Questions tagged [disaster-recovery]

Disaster recovery and preparedness is an unfortunate aspect of systems administration. This tag should be used for help with planning, implementation and best-practices related to recovering from a catastrophic event on a server or in a datacenter environment.

Recovering from an unplanned, catastrophic outage is a painful process whether you are managing a single server or an entire datacenter. Roof leaks, broken water lines, power outages and any number of other events can take what was a great day and turn it into a living nightmare when you are responsible for keeping systems others rely on available.

The key to recovering from any disaster is preparedness. Knowing the steps required to bring the network and systems back online is critical. Before one can properly prepare for a disaster it is necessary to understand the risks, bottlenecks and other critical components of the overall system, e.g. who controls the power, internet, etc at your site. Understanding the aspects of disaster recovery that are within ones control is a very important aspect when planning; if there is not someone on staff who can fix the power, HVAC, etc make sure that the contact info for someone who can is written down somewhere. Having a large amount of information available before a disaster occurs will help to keep everyone calm, cool and on-task when something actually does happen.

Once a risks are assessed and a plan is created, print out physical copies, email it, and make sure everyone with admin level access to the systems/datacenter has read and is familiar with them. The best plan in the world is worthless if it is on a system that is down and cannot be easily restored without following the plan. After everyone is familiar with the plan, practice when possible; in many situations it may not be realistic, but if possible take advantage of planned downtimes or natural outages to go through the recovery plan and refine it.

In summary, when a disaster happens:

  1. Don't Panic! Panic turns a debacle into a catastrophe every time.
  2. Plan ahead, understand the risks, and know what is within your control
  3. Follow the plan but be flexible, a recovery plan is more of a jazz tune than a military march
  4. Stay calm and organized, use check lists, keep notes
  5. If you are working in a team or group communicate and collaborate
  6. Be vigilant, update your plan as the environment changes
  7. Check your backups, make sure they happen at regular intervals and that the data contained therein is still good.
358 questions
1
vote
1 answer

*Lab* restored DC from full backup issues. NETLOGON error 67

I am in a lab environment practicing and learning to recover my primary domain controller in the event that both my DC's are gone and I have a full server backup in place for disaster recovery in production I have 2 domains DC1 and DC2, DC1 is the…
1
vote
1 answer

Do both Active Directory servers need to see each other for replication?

I'm trying to set up a remote DC for DR, and I've chosen to put it in AWS in a VPC with our other servers being backed up. I can restrict that with security groups to only accept traffic from the main office IP while still having a public IP…
icrf
  • 151
  • 5
1
vote
0 answers

Need help recovering CentOS system

My CentOS 6.5 system was getting errors about the hard disk so I used partclone to clone the root filesystem. Just 2 days later the hard disk died and the system would no longer boot. So I replaced the hard drive and using a CentoS Live CD I created…
Aditya K
  • 923
  • 3
  • 13
  • 24
1
vote
0 answers

How do I force an exchange database to become "active" in "Active Manager"

How do I force a passive node to become "active" when the active manager says it's needing a full sync? Background We had a catastrophic failure where all that remains is a single edb file. No backups. No log files. The database that remains is on…
makerofthings7
  • 8,911
  • 34
  • 121
  • 197
1
vote
3 answers

Is the windows event log the only place to perform a post-mortem after a server crash?

A couple of days ago one of our web-servers went down in the small hours. It wasn't responding to any remote requests (to be honest I don't know if it would have responded if a USB keyboard and monitor were plugged into it) and an engineer at the…
Dan
  • 783
  • 1
  • 13
  • 21
1
vote
1 answer

Recovering Linux soft raid 5, disk stays as spare

I got disk failure on my Centos Linux soft raid 5 array (mdadm). I replaced one of the disk and started to rebuild the array. Next time I checked the status, the rebuild was failed. This is the status right now: [root@localhost ~]# cat…
devha
  • 111
  • 1
1
vote
2 answers

Rebuild home directory after deletion

This is what happened: root@rasp:~# ls -al total 72 drwx------ 8 root root 4096 Jan 22 21:01 . drwxr-xr-x 25 root root 4096 Sep 11 14:27 .. -rw------- 1 root root 8079 Jan 22 19:55 .bash_history -rw-r--r-- 1 root root 570 Jan 31 2010…
Daniel W.
  • 1,609
  • 4
  • 26
  • 48
1
vote
1 answer

How can I plan a DR solution for an SAP Instance running on HP-UX

I'm supporting an environment that runs SAP on HP-UX. The HP Blade running the SAP Instance is connected to IBM Storage over FC connection. SAN Infrastructure is duplicated by redundant primary-secondary SAN switch and also primary secondary SAN…
Prayag Pal
  • 46
  • 3
1
vote
2 answers

Best DR option for two web servers located in different sites

We're looking to improve our Disater revoery and fail over capabilies or our Web server and Webpshere applicaiton server. We have 2 sites in the UK [HQ and Callcentre] and want to host a DR Webserver and application server at the call center site.…
scottyab
1
vote
4 answers

Exchange data recovery

I'm trying to do something a little bit obscure, if not to other people at least to me. My current task: Sneak into the (possibly corrupted) data of a Windows 2003 server hard drive, and extract the Exchange data from it, for one or more accounts. I…
1
vote
4 answers

lsass.exe error, Windows cannot boot

This is apocalypse. The server threw me an "lsass.exe" error this morning, saying that it cannot boot, with the following error. LSASS.EXE - System Error, security accounts manager initialization failed because of the following error: Directory…
Olivier Tremblay
  • 347
  • 3
  • 16
1
vote
3 answers

Replicating databases using Dell equallogic

Can the Dell Equallogic 6100/4100 replicate databases like mysql, MS SQL 2012, and Oracle 11g? I would like to set up my web applications and their databases in VMWare 5. They would run off the equallogic and be synchronized with another equallogic…
1
vote
3 answers

Small office backups, small NAS (RAID) vs single disk

Assume a small office situation, 5-10 workstations (PCs), and a Windows server hosting a network share. The Windows server is in a 3+1 RAID 5. I am debating with myself over a backup solution and I find myself thinking about it way more than I…
1
vote
2 answers

can i say "NO" to "Clone multiply-claimed blocks" in fsck?

I need to know if i can still mount the volume when i tell fsck "no" in that question. i dont need to repair all the files in the volume, i just need recover one single file for this event. the volume got some big files that put fsck stuck for…
Freaktor
  • 271
  • 2
  • 9
1
vote
3 answers

Best Disaster Recovery Setup for Domain Controller with Additional Services

I'm setting up some new Windows 2012 servers to replace old ones currently on 2003. One of my concerns is to try and have a suitable DR plan to get them back up and running if we have a major failure. I plan to have the following: Server 1: Domain…