Linux soft RAID6 how to identify last failing drive

Question

I have a RAID6 array that had been neglected, and just had the third drive fail. I want to do a ddrescue of the last drive to fail to try to recover the RAID, however I don't know how to identify which drive that was. To make matters worse, I'm using a 3ware RAID card with JBOD, so trying to identify which drive is mapped to which /dev/sdX device is problematic. To further make matters more complicated, when I rebooted with the replacement disks, all the device letters changed ...

All three failed disks are visible to the operating system, and partitioned as "Linux raid autodetect". The filesystem on top was XFS. Is there any way to query the disk to see when it was last written to?

The failure happened long enough ago that there is no record of it in /var/log/messages*

The safest option probably is to get a few more disks and then copy all the disks in parallel using ddrescue. — kasperd, Jan 26 '15 at 06:57
I'm currently running ddrescue on two of them (I only had 2 free slots in my chassis). Event after copying them however, I still need to know what was last-to-fail, right? Otherwise I'll be rebuilding the array with very old data — John P, Jan 26 '15 at 07:20
Perhaps the 3ware card's on-board firmware logging might still have something useful? Not sure what tool is needed to export though, I'm accustomed to LSI. — JimNim, Jan 26 '15 at 07:20
@JohnP It's been years since the last time I needed to look into that level of detail in Linux software RAID. I recall at the time, there was a counter in the header of each disk, which increased when the RAID was started or stopped and each time the set of disks in the RAID changed. In general the disks with the highest counter value had the most up to date data, though it was possible to get it to misbehave by booting a few times with only outdated disks present and none that were up to date. — kasperd, Jan 26 '15 at 07:36
symcbean - not a duplicate. I know which 3 drives failed. I need to know which one failed last. — John P, Jan 26 '15 at 15:24

score 0 · Answer 1 · answered Jan 26 '15 at 21:01

This might not work in a lot of cases, but saved me once.

Asumming all the disks still respond to SMART queries.

There is a SMART attribute that might hint to what the last failed disk was: 9 Power_On_Hours

The smart log might also give useful information:

# smartctl -l error /dev/sda

Some sample output for a failed disk:

Error 47 occurred at disk power-on lifetime: 4600 hours (191 days + 16 hours)

Of course, the best you will do with this is an informed guess.

Say, disk A has 5000 power_on_hours and disk B has 7000 power_on_hours. Last reported error on A was at 4600 and last reported error on B was at 5000. Well then it is likely that A was the last of the 2 to fail.

Either way, I would image all of the disks first, and only then start trying to gather further information or attempt a recovery.

Linux soft RAID6 how to identify last failing drive

1 Answers1