-1

Did a terrible stupid mistake. Please advise me how should I best proceed. My configuration is a RAID5 of 4x4TB drives. On top of it, LVM with various partitions including /, swap, everything. The configuration was quite automatic based on datacenter installation scripts.

I developed various performance issues during time with some processes constantly R/W on my drives so reading articles around about chunk size, I figured out I should do an experiment and decrease the chunk size from 512K to 64K. Thus for only changing this:

mdadm --grow -c 64 --backup-file=/root/somefile.txt /dev/md2

Yes, the file should had been EXTERNALY placed, but didn't had anything else connected and proceeded with the risk. The command exited immediately so figured out that was it. Did a ls, OK, and then server started to not respond. The only thing working were some processes like nginx that had -10 nice priority and did not had a way to stop them to see what is happening as nothing was working: webmin was loading forever, SSH was asking for the user and password and then no console, existin SSH connection blocked at a second ls. I figured out that my monitoring server and other processes are now eating all my I/O resources and this is why I can't do anything until I kill them all, so, from my datacenter console, sent CTRL+ALT+DEL to the server, it didn't work. Finally, figured out a hard reset would reboot and stop some things that would later need manual restart that would allow me to see what is wrong. Big mistake I assume.

The server didn't reboot as /dev/md2 cannot be found now and everything is on that RAID5 volume.

I read a lot about various mdadm --analyze --scan and strategies to recover but I really don't want to touch my system again until I ask some experts. It was my stupidity that led to this situation.

I can't say how important it is to recover as much as I can from the almost 11TB volume of unique data. I later on understood unfortunately that although mdadm exits fast, there are background processes working and most importantly the first part of a --grow is a critical one.. The power was interrupted after a few minutes, maybe 10 after starting the grow.

Please, advise. Thank you!

1 Answers1

2

I figured it out with the help of Phil Turmel from linux-raid@vger.kernel.org mailing list, in my case

mdadm -E /dev/sd[a-d]3 (partitions involved in the volume)

was giving consistent information and showing a reshape stopped somewhere ~50MB, strangely enough.

mdadm -Av --invalid-backup --backup-file=/some/real/empty/file /dev/md2 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3

really did the trick as although you tell mdadm that the backup file is invalid, you must provide another one in order for the operation to continue.

I really recommend using overlay files as described in https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file as this allowed me to firstly work with an invalid backup, mount partitions and extract the correct backup, revert all my drive changes by throwing away the overlays and then

mdadm -Av --backup-file=/extracted/valid/backup/file /dev/md2 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3

that went smooth without issues.

If you still don't have the file, you can simply stop at the first command although some data loss may be possible. However, unless you are very unlucky to lose some important filesystem/volume metadata with it, the damage should be quite minimal.

However, I really recommend to ask for help on the linux-raid@vger.kernel.org when in doubt as things are generally recoverable but wrong moves could really destroy things.

jazzman
  • 51
  • 4
  • I'm glad you got your data back; +1 for the excellent answer. However, -1 for the question, from me; if you carry any lessons away from this, they should be (a) that 16TB is too much HDD for a RAID-5, and (b) RAID is not a substitute for backups. It would be nice to know that, starting today, you are instituting a proper backup strategy. – MadHatter May 13 '16 at 07:26