Questions tagged [disaster-recovery]

Disaster recovery and preparedness is an unfortunate aspect of systems administration. This tag should be used for help with planning, implementation and best-practices related to recovering from a catastrophic event on a server or in a datacenter environment.

Recovering from an unplanned, catastrophic outage is a painful process whether you are managing a single server or an entire datacenter. Roof leaks, broken water lines, power outages and any number of other events can take what was a great day and turn it into a living nightmare when you are responsible for keeping systems others rely on available.

The key to recovering from any disaster is preparedness. Knowing the steps required to bring the network and systems back online is critical. Before one can properly prepare for a disaster it is necessary to understand the risks, bottlenecks and other critical components of the overall system, e.g. who controls the power, internet, etc at your site. Understanding the aspects of disaster recovery that are within ones control is a very important aspect when planning; if there is not someone on staff who can fix the power, HVAC, etc make sure that the contact info for someone who can is written down somewhere. Having a large amount of information available before a disaster occurs will help to keep everyone calm, cool and on-task when something actually does happen.

Once a risks are assessed and a plan is created, print out physical copies, email it, and make sure everyone with admin level access to the systems/datacenter has read and is familiar with them. The best plan in the world is worthless if it is on a system that is down and cannot be easily restored without following the plan. After everyone is familiar with the plan, practice when possible; in many situations it may not be realistic, but if possible take advantage of planned downtimes or natural outages to go through the recovery plan and refine it.

In summary, when a disaster happens:

  1. Don't Panic! Panic turns a debacle into a catastrophe every time.
  2. Plan ahead, understand the risks, and know what is within your control
  3. Follow the plan but be flexible, a recovery plan is more of a jazz tune than a military march
  4. Stay calm and organized, use check lists, keep notes
  5. If you are working in a team or group communicate and collaborate
  6. Be vigilant, update your plan as the environment changes
  7. Check your backups, make sure they happen at regular intervals and that the data contained therein is still good.
358 questions
2
votes
4 answers

Complete backup / restore solution Windows Sharepoint Services 3.0

Let me start by saying that if I've missed out on some very basic papers, KBs or anything, feel free to link me in the right direction. I've checked some threads here and haven't found the answer to the questions I have. I've created a simple script…
2
votes
3 answers

How to repair a damage transaction log file for Exchange 2003

Yesterday we had a power failure and the UPS did not work (it has worked perfect before). Everything seem to be ok when I started all the servers again except of the mail, when I try to mount the store I get the following message: “The database…
Markus Larsson
2
votes
1 answer

ESXi Host backup including extra vibs

I have a licensed ESXI Host with a couple of extra vibs (Chelsio, NVidia, dell) installed. I also have vcenter standard. More that I like to I discover that I have to reinstall the host, e.g. VMware now deprecating SDcards. I'm installing the vibs…
Chris
  • 21
  • 2
2
votes
3 answers

Need to recover RAID 6 array

By definition RAID 6 is an array of independent hard drives equipped with two independent and distributed checksum schemes Does that mean I can recover each of the hard drives one by one using normal (not RAID) recovery software? If no, is there…
2
votes
1 answer

Systemd and Disaster-Recovery Stand-By systems

We're using systemd to run various services in production. (Duh...) We're building out a matching "disaster-recovery" site, which will have the same application installed -- with the same systemd-units to bring up its various components in case of a…
Mikhail T.
  • 2,338
  • 1
  • 24
  • 55
2
votes
2 answers

Disaster Recovery for SQL 2008 with only the log file

When you read this, you just might cry... We wrote a SharePoint intranet app for the one of our clients that keeps track of files and manages them through various levels of approval. Several files containing important data were accidentally deleted…
user31530
1
vote
0 answers

Zerto Failover - Win 2012/16 Servers Get Stuck on "Getting Devices Ready"

We are using Zerto to replicate our VMs to an offsite datacenter. During our failover tests the VMs have a tendency to "hang" from 15 minutes to several hours on "Getting Devices Ready." I have read the other posts where the VMs never booted up,…
1
vote
2 answers

BTRFS unmountable after cold reboot (total_rw_bytes is twice too big)

One of my users in research environment invoked out-of-memory on a server which mounts a 52TB btrfs partition. I had to power cycle the server. After the reboot my btrfs partition cannot be mounted in read-write mode. mount /mnt/storage/ mount:…
Met
  • 373
  • 1
  • 8
1
vote
1 answer

Azure VM Site Replication

I'm looking to set up site replication for our Azure VM's. My question is this, how often are the VM's backed up/replicated to the DR site? Is this replication automated or would I need to kick off this process in a manual fashion? Also, is there a…
jrd1989
  • 698
  • 15
  • 48
1
vote
1 answer

btrfs recover: reliable or last-straw?

The chunk tree of a btrfs filesystem corrupted and I could not recreate it after hour-long reconstruction. As a last resort (besides restoring from backup) I could use btrfs restore -S -x -m -v to get all the files back. Does anyone know whether the…
Reiner Rottmann
  • 633
  • 1
  • 7
  • 19
1
vote
0 answers

Recovering VCS 2 node cluster in DR?

We are looking for a solution to recover the 2 node VCS Cluster at DR location for a DR Test. Prod: Hardware is Dell with 24 cores 64 GB RAM with Suse 11.4 Linux over SAN Boot & 4 NIC's of 1 Gig. DR: HP Hardware with 48 cores 128 GB RAM with Suse…
1
vote
0 answers

Tried to recover a Ext4 Superblock to a CEPH disk

We have a server with Proxmox installed with Debian, it's an dedicated server (sda). The VM's are being backed up every night on a CEPH disk (sdb). After an reboot of the dedicated, the ext4 drive from the dedicated server had a broken…
Roy Zon
  • 11
  • 1
1
vote
1 answer

Can I restore a SQL Server instance from file backups only?

Our windows 2016 server failed to restart after a windows update yesterday. In order to avoid the update that crashed it issue (whatever it was), I'm in the process of rebuilding it clean. So I've reinstalled Windows Server 2016, and SQL Server 2016…
1
vote
2 answers

Rebuilding RAID1 in Ubuntu

I had my second HD in my RAID1 come up with bad sectors. So I got another drive and pulled out the bad sector drive and put the new drive in. With the original working RAID1 drive in the computer it failed to boot. I manually copied everything…
John Utech
1
vote
2 answers

What would cause mysqlcheck to incorrectly report a table as undamaged?

We are administering a MySQL server for one of our customers that has >100 databases with about 50 tables each, many of them InnoDB tables. The server crashed and I'm trying to find the culprit. When restarting with innodb_force_recovery = 2, I can…
user2845840
  • 213
  • 1
  • 8