Questions tagged [disaster-recovery]

Disaster recovery and preparedness is an unfortunate aspect of systems administration. This tag should be used for help with planning, implementation and best-practices related to recovering from a catastrophic event on a server or in a datacenter environment.

Recovering from an unplanned, catastrophic outage is a painful process whether you are managing a single server or an entire datacenter. Roof leaks, broken water lines, power outages and any number of other events can take what was a great day and turn it into a living nightmare when you are responsible for keeping systems others rely on available.

The key to recovering from any disaster is preparedness. Knowing the steps required to bring the network and systems back online is critical. Before one can properly prepare for a disaster it is necessary to understand the risks, bottlenecks and other critical components of the overall system, e.g. who controls the power, internet, etc at your site. Understanding the aspects of disaster recovery that are within ones control is a very important aspect when planning; if there is not someone on staff who can fix the power, HVAC, etc make sure that the contact info for someone who can is written down somewhere. Having a large amount of information available before a disaster occurs will help to keep everyone calm, cool and on-task when something actually does happen.

Once a risks are assessed and a plan is created, print out physical copies, email it, and make sure everyone with admin level access to the systems/datacenter has read and is familiar with them. The best plan in the world is worthless if it is on a system that is down and cannot be easily restored without following the plan. After everyone is familiar with the plan, practice when possible; in many situations it may not be realistic, but if possible take advantage of planned downtimes or natural outages to go through the recovery plan and refine it.

In summary, when a disaster happens:

  1. Don't Panic! Panic turns a debacle into a catastrophe every time.
  2. Plan ahead, understand the risks, and know what is within your control
  3. Follow the plan but be flexible, a recovery plan is more of a jazz tune than a military march
  4. Stay calm and organized, use check lists, keep notes
  5. If you are working in a team or group communicate and collaborate
  6. Be vigilant, update your plan as the environment changes
  7. Check your backups, make sure they happen at regular intervals and that the data contained therein is still good.
358 questions
3
votes
6 answers

Is there a good online backup solution for SBS2003?

I support an environment for a non-profit that is a single SBS 2003 server running Exchange and Sharepoint. They are in need of a decent backup solution that I would prefer incorporate online backups, possibly with local copies as well using a…
3
votes
2 answers

How to start MSSQL Server with corrupt model db

After moving some databases around (restoring, deleting, etc) we experienced an issue creating new databases. Specifically, When trying to create a new database MSSQL Server it failed because the "The database 'model' is marked RESTORING and is in a…
Julia McGuigan
  • 171
  • 1
  • 1
  • 9
3
votes
2 answers

Mysqld InnoDB crash

my MySQL server just crashed and I cannot restart/recover it. I've tried: /etc/init.d/mysql restart Stopping MySQL database server: mysqld. Starting MySQL database server: mysqld . . . . . . . . . . . . . . failed! and mysqld --verbose…
Petr Peller
  • 193
  • 1
  • 7
3
votes
5 answers

Licensing a SQL DR box

Possible Duplicate: Can you help me with my software licensing question? I'm tempted to use a DR server which holds a mirrored copy of a database to also perform test restores and integrity checks of the principle database. There maybe issues…
3
votes
1 answer

Symantec Backup exec recovery test of my exchange server

I came into a new environment which runs pretty old systems. Windows Server 2003 with Exchange 2003. They backing up servers with Symantec backup exec 11d. For compliance reason and that I can sleep better I would like to do a test restore of my…
server info
  • 157
  • 2
  • 8
3
votes
2 answers

Recovering a Hyper-V vhd from a snapshot

I'm going to leave out tons of related info that I hope isn't relevant to keep this lean - feel free to ask for detail. My host is a Windows Server 2008 Standard SP2 (not R2). Last February we had created a snapshot on a virtual machine and then…
Kevin Donn
  • 179
  • 1
  • 4
3
votes
2 answers

Has anyone setup a temporary webmail server to catch emails during an outage?

I've recently been in a situation where an Exchange email server went down and we weren't able to bring it back online after 48hrs, so some incoming emails bounced. Also, users weren't able to send or receive emails for a few days, which was a sore…
zippy
  • 1,718
  • 3
  • 21
  • 36
3
votes
1 answer

What is considered a disaster?

In doing research on a disaster recovery plan and trying to develop scenarios that must be accounted for, I realize that there are a number of different events that qualify as disasters. For example, all of these can be considered…
Steve Jones
  • 795
  • 5
  • 8
3
votes
5 answers

How do you deal with the attitude that backups are not important?

I have noticed at my place of work that there is an attitude that backups are not very important (certainly, development/testing happens before any form of backup strategy is in place). Because the rest of my team are not system admins/lack system…
GurdeepS
  • 1,646
  • 5
  • 26
  • 33
3
votes
4 answers

How long does it take to rebuild a drive in a RAID 6?

I'm building a 7-disk RAID 6 array on a DELL MD3000 DAS box. My top priority is storage space, so I'd like to use 2TB drives -- but I'm worried about how long it will take to rebuild a failed disk. Is there a formula for figuring out how long a…
Jesse
3
votes
2 answers

How best to handle end user notification in the event of system failure incl. email?

I've been asked to research ways of handling end user notifications when systems such as email are experiencing problems. Perhaps an example will make this a little clearer. We have a number of sites in different countries. Recently email was…
Brian Lyttle
  • 1,757
  • 1
  • 17
  • 17
3
votes
1 answer

Sql Server differential backup : Simple vs Full recovery model

I need to better understand the backup process under SQL Server 2008. Since drive space is a kind of matter for us and we want to have a better disaster recovery solution, I decided that we will implement differential backups throughout the day…
MaxiWheat
  • 237
  • 5
  • 12
3
votes
6 answers

Wiki/CMS with synchronization?

We're looking into putting up a wiki or CMS for internal use by our IT department. One of the big things we want to use it for is disaster recovery procedures. Given that a disaster, such as a power or network outage, might render the wiki…
Clinton Blackmore
  • 3,520
  • 6
  • 36
  • 61
3
votes
2 answers

Can't Repair Mysql Table

I have one ARCHIVE table that I simply can't repair, I already try to remove the partitioning but still get this error: alter table promo_tool_view_44 REMOVE PARTITIONING; ERROR 1034 (HY000): Incorrect key file for table 'promo_tool_view_44'; try to…
Pedro
  • 667
  • 2
  • 9
  • 20
3
votes
5 answers

Disk failed part way through 3ware RAID 5 rebuild

I have a 3ware 9650SE RAID controller with a RAID 5 array containing 15 Seagate ST31000340NS disks. After noticing ECC errors in the Port 10 drive I replaced it with a spare and began a RAID rebuild. Part way through the rebuild the Port 5 disk…
Dan
  • 31
  • 1
  • 3