Questions tagged [disaster-recovery]

Disaster recovery and preparedness is an unfortunate aspect of systems administration. This tag should be used for help with planning, implementation and best-practices related to recovering from a catastrophic event on a server or in a datacenter environment.

Recovering from an unplanned, catastrophic outage is a painful process whether you are managing a single server or an entire datacenter. Roof leaks, broken water lines, power outages and any number of other events can take what was a great day and turn it into a living nightmare when you are responsible for keeping systems others rely on available.

The key to recovering from any disaster is preparedness. Knowing the steps required to bring the network and systems back online is critical. Before one can properly prepare for a disaster it is necessary to understand the risks, bottlenecks and other critical components of the overall system, e.g. who controls the power, internet, etc at your site. Understanding the aspects of disaster recovery that are within ones control is a very important aspect when planning; if there is not someone on staff who can fix the power, HVAC, etc make sure that the contact info for someone who can is written down somewhere. Having a large amount of information available before a disaster occurs will help to keep everyone calm, cool and on-task when something actually does happen.

Once a risks are assessed and a plan is created, print out physical copies, email it, and make sure everyone with admin level access to the systems/datacenter has read and is familiar with them. The best plan in the world is worthless if it is on a system that is down and cannot be easily restored without following the plan. After everyone is familiar with the plan, practice when possible; in many situations it may not be realistic, but if possible take advantage of planned downtimes or natural outages to go through the recovery plan and refine it.

In summary, when a disaster happens:

  1. Don't Panic! Panic turns a debacle into a catastrophe every time.
  2. Plan ahead, understand the risks, and know what is within your control
  3. Follow the plan but be flexible, a recovery plan is more of a jazz tune than a military march
  4. Stay calm and organized, use check lists, keep notes
  5. If you are working in a team or group communicate and collaborate
  6. Be vigilant, update your plan as the environment changes
  7. Check your backups, make sure they happen at regular intervals and that the data contained therein is still good.
358 questions
2
votes
1 answer

Recover ZFS data

Context/Hardware: HP microserver gen8 1x1TB - standalone, 2x4TB Raid 1x16GB iLO SDCARD with Debian + OpenMediaVault Event: SDCARD failure restarted server and installed Ubuntu on 1TB drive Consequences: ZFS not accessible…
Daniel Voina
  • 188
  • 1
  • 10
2
votes
5 answers

Do you maintain an OMG it's !@#$ed CD with Utilities to recover and clean computers? What is on it?

I'm getting ready to build one with some of the common apps I use including malwarebytes, Spybot, and memtest86 . What elese do you recommend I add? Is there a single CD image out there that has all these tools already that I can just burn?
Zak
  • 1,032
  • 2
  • 15
  • 25
2
votes
1 answer

Simulate power loss with force unmount?

I want to test disaster recovery of RDBMs after power loss under high load. My idea is to mount data directory under new mountpoint and then execute umount -f during the load and investigate outcome / state of files. My expectation is that with…
noonex
  • 248
  • 2
  • 10
2
votes
1 answer

Debian Lenny - SAN - LVM Fail

I've got a Lenny server that has got a SAN connection configured as the only PV for a VG named 'datavg'. Yesterday, I've updated the box with Debian patches and gave it a reboot. After the reboot, it didn't boot up saying that it couldnt find…
Ger Apeldoorn
  • 565
  • 3
  • 10
2
votes
5 answers

Backup Linux to Windows Remote Site

We have two sites, one linux (Ubuntu) based, the other windows based which we would like to connect permently over VPN (using openVPN). We need to backup some files on a number of linux (Ubuntu) servers to the windows Server at a remote site. I'm…
Mr Shoubs
  • 363
  • 2
  • 9
  • 32
2
votes
1 answer

MongoDB Active-Active Multiple Datacenters

I am looking for a method to configure MongoDB servers across 2 different datacenters where they remain in an active-active configuration. Site A is the normal Production environment that customers access and all writes are sent here, but all data…
Eroji
  • 203
  • 2
  • 5
  • 8
2
votes
2 answers

Keep the same IP and host name for new domain controller as crashed one

Because of some disaster I have lost my primary domain controller and thanks god that I had secondary domain controller so I could provide the service to the computer and save my data. After that I have seized the all 5 FSMO roles to the secondary…
2
votes
1 answer

Windows Server 2008 R2 - Keyboard isn't recognized on login screen, but mouse does

Problem Backgroud: Our company has an IBM System X3200 M3 server running Windows Server 2008 R2 Enterprise with two WS Caviar Black 1TB HD in RAID 1. Recently we had a power failure that corrupted the system boot, and since the previous IT guy…
2
votes
1 answer

Is a Wildcard cert better for Exchange DR than a SAN cert?

I'm reading this blog article about the logic of Autodiscover, and I hope I'm misreading it. The problem I have is that it appears that the Autodiscover process will inspect the SUBJECT of a certificate and use that to determine the most appropriate…
2
votes
4 answers

Windows Server backup / disaster recovery best practice for small shop?

I work for a small school that has two Windows 2003 servers. I'm trying to get a grip on cost-effective strategies to handle the failure of one or both servers. I come from a UNIX background, so I understand the various strategies and trade-offs I…
Paul Holbrook
  • 151
  • 1
  • 4
2
votes
2 answers

Missing VirtualDisk on bootup (RAID)

I have a Dell PowerEdge 2950 with DAS storage attached (MD1000) I rebooted the server to apply window updates (win2008), on restart BIOS detected a VirtualDisk was missing and wanted me to continue or raid Config and import conf .... I continued…
Logman
  • 445
  • 2
  • 16
  • 28
2
votes
1 answer

How can I test my DR on Netapp without stopping replication?

I have two Netapp filers running ONTAP 7-mode with some volumes on the main one replicated to the secondary one on another site. I need to regularly test my DR, but I can't break the replication during the test. Sometimes they're quite long, and a…
Basil
  • 8,851
  • 3
  • 38
  • 73
2
votes
1 answer

Method for offsite backup of EC2 servers?

We run a dozen or so Ubuntu Linux webserver production instances on Amazon VPC. The instances are bootstrapped and managed via Puppet. Most management is done via the AWS Console. Our AWS credentials are pretty secure. The master-account is hardly…
Martijn Heemels
  • 7,728
  • 7
  • 40
  • 64
2
votes
1 answer

How to recover deleted root of linux system?

The story: I am using Ubuntu for my website. For a backup plan, I wrote a script. I wanted to mount an external drive, (run through crontab), backup the server, and then umount it, every night. I noticed that through fdisk -l I got: /dev/sda …
Jackson
  • 41
  • 1
  • 3
2
votes
1 answer

SAN Synchronous Replication and SQL Server - Is RPO of 0 possible?

I am working on a solution that is using SQL Server 2012 SP2, but without the use of AlwaysOn availability groups. This is due to cross-database transactions, that does not work for this scenario. Note: This is being addressed as we speak, but…