Dealing with problems on a "best efforts" basis only works if you can be made aware of the problem. The reason I am bringing this up is that I had to go in today to fix a problem with one of our backup servers: the problem had started late yesterday afternoon and went almost 24 hours undetected!!! If the monitoring that we have set up had worked properly, someone should have been called within an hour of the problem taking place. But somehow part of the process that gets the alert to the right place did not happen.
And it turns out that the problem was that the power went out in the room where the server is located. The server itself is on a UPS, but it turns out that the terminal that is used as the console is not. And that means that, when the power goes out, the terminal shuts down and if the server key is not in the locked position it acts as if it had received a break signal and goes to the "ok>" prompt (this will make sense to people familiar with Sun servers). Suffice it to say that the server is basically stopped when it is at the "ok>" prompt. All that needed to be done was type "go" and the server resumed processing!
Turns out that I have spent most of the day away from home: Church from 10:30 - 15:00 (service and meeting) and then I had to go to work (3.5 hours worth).
Good night!