WHEN to put the contingency plan into action in case of a main server failure?

Question

We have a production SQL Server database server shipping transactional log backups to two standby servers. The disaster recovery plan is already finished: we have a well documented procedure and people trained to put the standby server into production, and initiating replication, enabling the jobs, etc, with minimal downtime.

The problem that is gaining discussion is not the contingency plan itself, but the DECISION to put the standby server into production and losing, in the worst case, 12 minutes of information (the transaction log backup runs every 10 minutes and is very fast to be copied to the other servers).

The decision could be difficult because we can waste time trying to identify the problem. On the other hand, the problem could be simple to resolve and we could put back the server into production without using the other servers.

We understand that the situation will become very stressful in the event of a system failure, and we think that in these situations, it is better to have a standard procedure and a minimum of decisions.

So, we have a dilemma. Is it better to just change servers when something goes wrong with the main server, or better to try to identify and resolve the problem in the main server? What do you guys think about this?

Kyle Brandt · Accepted Answer · 2010-08-21T12:56:24.250

A framework you might want to use is two time windows for deciding this at the time of the problem. The end of the first time window will be a soft limit and the second will be a hard limit of when to switch over.

The soft limit will be a first point of cut over. If you have been trying to solve the problem but are nowhere closer to solving it than when you started you would switch at the soft limit. If you think you are getting close to solving the problem at the soft limit you would then keep going until the hard limit. So the soft limit would be 5 minutes for example, and the hard limit will be maybe 8 minutes from the start of trying to fix the problem. At the hard limit, you switch over no mater what.

The length of the windows you use you are going to have to decide for yourself. You also have to figure out if you want to include the amount of time it takes before you actually start looking at the problem.

You also could of course just wing it and do what you think is best at the time -- it is likely okay not to plan every last little detail.

score 3 · Answer 2 · answered Aug 20 '10 at 20:33

It's all about costs. What does it cost to try and fix the problem for X minutes/hours? Is it less than the cost of switching to a backup server, lose some date, and eventually move back to the main production server?

Once the cost of trying to fix exceeds the cost of switching, the decision is made, switch. Until you have a handle on the costs, how can you define a "disaster"?

WHEN to put the contingency plan into action in case of a main server failure?

2 Answers2