HP Proliant DL380 G6 - Recover after 2nd disk failure during RAID 1 rebuild

Question

** Disclaimer, I just recently became an administrator to this system and realized that the backups are unusable. As well, the state of administration software is terrible **

The system (Ubuntu 14.04) was running a two 146GB 10k SAS drives in RAID 1 (A, and B). The enclosures are hot-swappable so the server was and is still running through this process.

Failed drive A was replaced with drive C, flashing green status confirmed that array was rebuilding
Come back to C with solid green status (online) but drive B solid amber (offline/critical failure)
However, there are large patches of filesystem that were clearly not synced, indicated by Input/output errors, and the filesystem reverting to Read-Only

My goal is to determine the source of the drive B failure, and if it's something small such as an Unreadable Block Error, to either restart the system using drive B, or to force a rebuild of the array despite the errors. The main thing is determining how to get the array controller report the failure mode, and treat the failed drive as good.

I'm only looking to recover a few small config files to make my life easier when reinstalling.

The server is currently on in a limited state, but definitely won't boot from drive C if restarted, as portions of /bin/ were lost. Surprisingly it's still serving it's function as it is only used regularly for dhcp and ssh.

score 1 · Accepted Answer · edited Feb 15 '17 at 22:11

I eventually solved this and I actually managed to recover most of the configs.

The filesystem was mounted as read-only as linux had detected the fault and attempted to prevent any more damage.

Reboot the system to a live CD, at raid prompt force the system to ignore the [newly] dead drive
Install the HP Array Configuration Utility (HPACUCLI) to inspect the raid status, mount the drive and back up the files that I can (~24h on-time total)
Remove the Live CD and restart, booting into the original OS (which actually worked!)
Run fsck on the original disk (A lot of /home/ data was lost, but that wasn't a problem)
Replace the newly failed drive, set up a proper backup strategy so this doesn't happen again.

HP Proliant DL380 G6 - Recover after 2nd disk failure during RAID 1 rebuild

1 Answers1