How to diagnose failed drive on server with raid storage controller?

Question

I have problem with one of my Dell PowerEdge R210. Machine is with Centos 6, today system logs started to inform that the hard drive is failing.

Jan  6 03:20:12 centos6 kernel: LSI Debug log info 31080000 for channel 0 id 0
Jan  6 03:20:12 centos6 kernel: sd 0:1:0:0: [sda] Unhandled sense code
Jan  6 03:20:12 centos6 kernel: sd 0:1:0:0: [sda] Result: hostbyte=invalid driverbyte=DRIVER_SENSE
Jan  6 03:20:12 centos6 kernel: sd 0:1:0:0: [sda] Sense Key : Medium Error [current]
Jan  6 03:20:12 centos6 kernel: Info fld=0x21a9055
Jan  6 03:20:12 centos6 kernel: sd 0:1:0:0: [sda] Add. Sense: Unrecovered read error
Jan  6 03:20:12 centos6 kernel: sd 0:1:0:0: [sda] CDB: Read(10): 28 00 02 1a 90 20 00 00 38 00
Jan  6 03:22:17 centos6 kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Jan  6 03:22:17 centos6 kernel: LSI Debug log info 31080000 for channel 0 id 0
Jan  6 03:22:17 centos6 kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Jan  6 03:22:17 centos6 kernel: LSI Debug log info 31080000 for channel 0 id 0
Jan  6 03:22:17 centos6 kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Jan  6 03:22:17 centos6 kernel: LSI Debug log info 31080000 for channel 0 id 0
Jan  6 03:22:17 centos6 kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Jan  6 03:22:17 centos6 kernel: LSI Debug log info 31080000 for channel 0 id 0
Jan  6 03:22:17 centos6 kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Jan  6 03:22:17 centos6 kernel: LSI Debug log info 31080000 for channel 0 id 0
Jan  6 03:22:17 centos6 kernel: sd 0:1:0:0: [sda] Unhandled sense code
Jan  6 03:22:17 centos6 kernel: sd 0:1:0:0: [sda] Result: hostbyte=invalid driverbyte=DRIVER_SENSE
Jan  6 03:22:17 centos6 kernel: sd 0:1:0:0: [sda] Sense Key : Medium Error [current]
Jan  6 03:22:17 centos6 kernel: Info fld=0x21a7d89
Jan  6 03:22:17 centos6 kernel: sd 0:1:0:0: [sda] Add. Sense: Unrecovered read error
Jan  6 03:22:17 centos6 kernel: sd 0:1:0:0: [sda] CDB: Read(10): 28 00 02 1a 7d 80 00 00 18 00
Jan  6 03:22:19 centos6 kernel: sd 0:1:0:0: [sda] Unhandled sense code
Jan  6 03:22:19 centos6 kernel: sd 0:1:0:0: [sda] Result: hostbyte=invalid driverbyte=DRIVER_SENSE
Jan  6 03:22:19 centos6 kernel: sd 0:1:0:0: [sda] Sense Key : Medium Error [current]
Jan  6 03:22:19 centos6 kernel: Info fld=0x21a7dc0
Jan  6 03:22:19 centos6 kernel: sd 0:1:0:0: [sda] Add. Sense: Unrecovered read error
Jan  6 03:22:19 centos6 kernel: sd 0:1:0:0: [sda] CDB: Read(10): 28 00 02 1a 7d c0 00 00 80 00
Jan  6 03:28:05 centos6 kernel: sd 0:1:0:0: [sda] Unhandled sense code
Jan  6 03:28:05 centos6 kernel: sd 0:1:0:0: [sda] Result: hostbyte=invalid driverbyte=DRIVER_SENSE
Jan  6 03:28:05 centos6 kernel: sd 0:1:0:0: [sda] Sense Key : Medium Error [current]
Jan  6 03:28:05 centos6 kernel: Info fld=0x21a7d88
Jan  6 03:28:05 centos6 kernel: sd 0:1:0:0: [sda] Add. Sense: Unrecovered read error
Jan  6 03:28:05 centos6 kernel: sd 0:1:0:0: [sda] CDB: Read(10): 28 00 02 1a 7d 88 00 00 08 00
Jan  6 03:28:09 centos6 kernel: sd 0:1:0:0: [sda] Unhandled sense code
Jan  6 03:28:09 centos6 kernel: sd 0:1:0:0: [sda] Result: hostbyte=invalid driverbyte=DRIVER_SENSE

Now I assume that this machine has RAID controller but don't know what type of RAID is configured (if there is any).

Output from lspci:

01:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)

So this is my question: Is there a way to diagnose that problem without restarting machine, from linux command line? From system level I see only logical drive not hard drives that are connected in RAID which is normally good but now I wanna check if there is RAID and which hard drives are members of this RAID and which hard drive is failing.

EDIT1. For this moment I have only ssh access to this machinse so that's the reason why I want to know if this possible to diagnose this problem via ssh.

I am voting to close because in a professional capacity you do not run a raid controller without installing the manufacturers tools - or, a operating system that is not supported. Not sure others will agree - i see this also as an edge case. No restart - sorry. THe way I read it you already lost data and have a corrupt file system. That is not the moment you care about a restart. It is the moment take the backup and start caring about that one. And hope - that the error is in a part of the disc not used by your data (which may well be the case). — TomTom, Jan 07 '14 at 10:45
Problem is not I don't want to restart it ... but don't have access other than ssh to this server that is why I want do diagnose it that way, wile waiting to get access to KVM over IP. — B14D3, Jan 07 '14 at 10:50
I think that wont help a lot. My advice though is to start with a file system scan. I really do not like the unrecoverable error here. This can be a MegaRAid raid controller, btw. - hthat is what google told me. THere is a MegaCTL package available for command line. Get the manufacturer, install the tools. — TomTom, Jan 07 '14 at 10:59
lets see what kind of disk or raid it is - please include complete output of: smartctl -a /dev/sda — Bartłomiej Zarzecki, Jan 07 '14 at 11:40
@BartłomiejZarzecki smartctl doesn't work with virtual drives — B14D3, Jan 07 '14 at 13:35

score 3 · Answer 1 · answered Jan 09 '14 at 18:36

If you're unwilling to restart your system to install the manufacturer's tools you're basically going to sit here being stubborn until the machine completely dies.
At that point it doesn't matter what you want. The machine will be down, probably for good. You won't have to worry about restarting because you'll have to do so as part of replacing the hard drives & restoring from your backups. (You DO have backups, right?).

Lecture Over.

If you don't want to install the manufacturer's diagnostic tools your sole remaining option is to physically walk up to the machine and look for the drive with the blinking red (or yellow) "failure" light. Replace that one.
This of course presumes RAID-1, RAID-5, RAID-6, or some other configuration that lets you replace a single failed drive (and that you only have a single failed drive). If you are not in such a configuration, or more drives have failed than your system's fault tolerance level, you're back to "replace all the drives and restore from backup".

Lacking backups you're stuck with "MAKE BACKUPS, then if you didn't get everything you need call a data recovery company and try to salvage what you can".

score 1 · Answer 2 · answered Oct 31 '14 at 09:29

The disk has medium errors on it, which means there is data that is unreadable. The LSI log info (0x31080000) tells you that later IOs were failed due to the way SATA does error recovery. That's not an issue by itself.

You have a bad disk and you already lost data, if you have RAID then it failed in its work too. Most likely though you do not have RAID at all, to know that you should look at the output of lsscsi, it will tell you the make and model of the /dev/sda device. If it says a hard disk (WD, Hitachi, Seagate) you have a lone HDD there, if it says LSI you have a RAID device.

Either way you already lost data since even if it is a RAID device it failed to recover from the underlying Medium Error and returned a Medium Error at the end as well.

What do do after this?

You need to find what files you lost, try to read them one by one (find, xargs and cat are a good bunch for this) and see what files cannot be read. You'll need to bring them from backup.

To recover the sectors just write onto them again and it will fix the current medium error, you can just delete the files or write over them and the filesystem will do that on its own time.

To know if the HDD is still worth using you'll need to see if the issue repeats or expands, you can use smartctl for that, look for the number of reallocations mostly, if it is growing by more than once any month you want to replace it.

If in doubt and you care about the data, replace the disk. A disk with medium errors is more likely to be bad than one without any.

How to diagnose failed drive on server with raid storage controller?

2 Answers2