Monday, August 15, 2005
Shooting ourselves in the foot?

It would appear that the problem with the backups for the last couple of days stems from two things: size of the environment and configuration settings. A scheduler gets started every 20 minutes to run scheduled backups. When it starts, it goes out and verifies that it can talk to all the backup clients before running the jobs. Normally that is not a problem. Add to the mix the fact that if it cannot talk to a backup client it waits until the connect timeout before proceeding to the next client and now you can have problems.

We currently have just over 800 clients in this particular backup environment. Looking at the logs, it would appear to take about 5 seconds to communicate with each client. So if everything works perfectly, it takes about 66 minutes to go through the complete list of clients. Add in those clients that we cannot reach, and take into account that the connect timeout is 20 minutes, then it does not take many unreachable clients before you have a real problem. In our case last night, we had 9 clients that were unreachable: that's 180 minutes! Therefore, if the scheduler received the list of jobs to start at 7pm, it would be well after 11pm before those jobswould start.

I have reduced the connect timeout to 5 minutes, which is the default for Netbackup. At least now it will be less than 2 hours from submission to running if the same problem occurs again.

 
posted by Christian Thibodeau at 12:53 PM | Permalink |


0 Comments: