Did a terrible stupid mistake. Please advise me how should I best proceed. My configuration is a RAID5 of 4x4TB drives. On top of it, LVM with various partitions including /, swap, everything. The configuration was quite automatic based on datacenter installation scripts.
I developed various performance issues during time with some processes constantly R/W on my drives so reading articles around about chunk size, I figured out I should do an experiment and decrease the chunk size from 512K to 64K. Thus for only changing this:
mdadm --grow -c 64 --backup-file=/root/somefile.txt /dev/md2
Yes, the file should had been EXTERNALY placed, but didn't had anything else connected and proceeded with the risk. The command exited immediately so figured out that was it. Did a ls, OK, and then server started to not respond. The only thing working were some processes like nginx that had -10 nice priority and did not had a way to stop them to see what is happening as nothing was working: webmin was loading forever, SSH was asking for the user and password and then no console, existin SSH connection blocked at a second ls. I figured out that my monitoring server and other processes are now eating all my I/O resources and this is why I can't do anything until I kill them all, so, from my datacenter console, sent CTRL+ALT+DEL to the server, it didn't work. Finally, figured out a hard reset would reboot and stop some things that would later need manual restart that would allow me to see what is wrong. Big mistake I assume.
The server didn't reboot as /dev/md2 cannot be found now and everything is on that RAID5 volume.
I read a lot about various mdadm --analyze --scan and strategies to recover but I really don't want to touch my system again until I ask some experts. It was my stupidity that led to this situation.
I can't say how important it is to recover as much as I can from the almost 11TB volume of unique data. I later on understood unfortunately that although mdadm exits fast, there are background processes working and most importantly the first part of a --grow is a critical one.. The power was interrupted after a few minutes, maybe 10 after starting the grow.
Please, advise. Thank you!