Is my hard drive about to die?

Question

I have two hard drives set up as a RAID 1 array on my server (Linux, software RAID using mdadm) and one of them just got me this "present" in syslog:

Nov 23 02:05:29 h2 kernel: [7305215.338153] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:29 h2 kernel: [7305215.338178] ata1.00: irq_stat 0x40000008
Nov 23 02:05:29 h2 kernel: [7305215.338197] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:29 h2 kernel: [7305215.338220] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:29 h2 kernel: [7305215.338221]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:29 h2 kernel: [7305215.338287] ata1.00: status: { DRDY ERR }
Nov 23 02:05:29 h2 kernel: [7305215.338305] ata1.00: error: { UNC }
Nov 23 02:05:29 h2 kernel: [7305215.358901] ata1.00: configured for UDMA/133
Nov 23 02:05:32 h2 kernel: [7305218.269054] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:32 h2 kernel: [7305218.269081] ata1.00: irq_stat 0x40000008
Nov 23 02:05:32 h2 kernel: [7305218.269101] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:32 h2 kernel: [7305218.269125] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:32 h2 kernel: [7305218.269126]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:32 h2 kernel: [7305218.269196] ata1.00: status: { DRDY ERR }
Nov 23 02:05:32 h2 kernel: [7305218.269215] ata1.00: error: { UNC }
Nov 23 02:05:32 h2 kernel: [7305218.341565] ata1.00: configured for UDMA/133
Nov 23 02:05:35 h2 kernel: [7305221.193342] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:35 h2 kernel: [7305221.193368] ata1.00: irq_stat 0x40000008
Nov 23 02:05:35 h2 kernel: [7305221.193386] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:35 h2 kernel: [7305221.193408] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:35 h2 kernel: [7305221.193409]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:35 h2 kernel: [7305221.193474] ata1.00: status: { DRDY ERR }
Nov 23 02:05:35 h2 kernel: [7305221.193491] ata1.00: error: { UNC }
Nov 23 02:05:35 h2 kernel: [7305221.388404] ata1.00: configured for UDMA/133
Nov 23 02:05:38 h2 kernel: [7305224.426316] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:38 h2 kernel: [7305224.426343] ata1.00: irq_stat 0x40000008
Nov 23 02:05:38 h2 kernel: [7305224.426363] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:38 h2 kernel: [7305224.426387] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:38 h2 kernel: [7305224.426388]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:38 h2 kernel: [7305224.426459] ata1.00: status: { DRDY ERR }
Nov 23 02:05:38 h2 kernel: [7305224.426478] ata1.00: error: { UNC }
Nov 23 02:05:38 h2 kernel: [7305224.498133] ata1.00: configured for UDMA/133
Nov 23 02:05:41 h2 kernel: [7305227.400583] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:41 h2 kernel: [7305227.400608] ata1.00: irq_stat 0x40000008
Nov 23 02:05:41 h2 kernel: [7305227.400627] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:41 h2 kernel: [7305227.400649] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:41 h2 kernel: [7305227.400650]          res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:41 h2 kernel: [7305227.400716] ata1.00: status: { DRDY ERR }
Nov 23 02:05:41 h2 kernel: [7305227.400734] ata1.00: error: { UNC }
Nov 23 02:05:41 h2 kernel: [7305227.472432] ata1.00: configured for UDMA/133

From what I read so far, I am not sure if read errors mean that a hard drive is dying on me (no write errors so far). I've had hard drive errors in the past and those always had errors about failing to write to specific sectors in the logs. Not this time.

Should I be replacing the drive? Could something else be causing the problem?

I've scheduled a smartctl -t long test that will finish in a couple of hours. I hope this will give me some more info.

UPDATE: Something like a miracle happened. Details below:

I was backing up some files off that machine, preparing to replace the faulty drive. Then, as I was copying those huge files, I got this logcheck email:

Security Events for kernel
=-=-=-=-=-=-=-=-=-=-=-=-=-
Nov 23 17:16:24 h2 kernel: [7359837.963597] end_request: I/O error, dev sdb, sector 1202093816
Nov 23 17:16:41 h2 kernel: [7359855.196334] end_request: I/O error, dev sdb, sector 1202093816

System Events
=-=-=-=-=-=-=
Nov 23 17:14:06 h2 kernel: [7359700.193114] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 17:14:06 h2 kernel: [7359700.193139] ata2.00: irq_stat 0x40000008
Nov 23 17:14:06 h2 kernel: [7359700.193158] ata2.00: failed command: READ FPDMA QUEUED
Nov 23 17:14:06 h2 kernel: [7359700.193180] ata2.00: cmd 60/08:00:58:03:aa/00:00:47:00:00/40 tag 0 ncq 4096 in
Nov 23 17:14:06 h2 kernel: [7359700.193181]          res 41/40:08:58:03:aa/00:00:47:00:00/00 Emask 0x409 (media error) <F>
Nov 23 17:14:06 h2 kernel: [7359700.193247] ata2.00: status: { DRDY ERR }
Nov 23 17:14:06 h2 kernel: [7359700.193265] ata2.00: error: { UNC }
Nov 23 17:14:06 h2 kernel: [7359700.194458] ata2.00: configured for UDMA/133

Oops! My hair, if I had some on my shaved head, stood up. See, it's real effing bad sectors on the second drive. Now what? With two faulty drives, what do I do?

I gave it some thought and decided that I:

Had one drive that I suspect to be faulty
And another that I'm 100% sure to be faulty with the bad sector complaints in the log.

So I replaced the second one, not the one I originally posted the question about. I had several partitions, each set up on a different RAID, and I was hoping that I'd be able to resync at least the root and boot ones, so that I don't have to reinstall everything on the server. I'd probably have to restore the huge data partition from backup, but well, I'd save me some work.

Replaced the drive, started the resyncs. Root and boot partitions (about 50GB) resynced really fast. No errors. I'm a happy camper!

Just for kicks, let's try resyncing the huge data partition -- it's about 2TB with 500GB of data on it. I started the resync and watched it for a while. It seemed to take forever, and I brought the server online, letting users use their stuff. Resync happening in the background. And, what do you know, about 18 hours later the resync is over with no errors. Server is fully alive now.

I wonder if I should be replacing the original drive now. I'm sure the server god of hard drives is laughing his butt off at me.

score 10 · Accepted Answer · answered Nov 23 '12 at 15:23

10

It's not about to die.. It's already dead.

Replace it ASAP, and restore from backups if you lose any data.

answered Nov 23 '12 at 15:23

Tom O'Connor

27,480
10
73
148

Isn't the point of RAID 1 to survive the loss of all but one of the disks in the array without loss of data? Let's just hope that the only remaining disk doesn't break down under the load of rebuilding the array once the dead drive has been replaced. – user Nov 23 '12 at 15:34
Oh, in theory, that's the plan... But it depends on the grade of the drives.. If they're Seagate Blacks it should be fine.. If they're Green or Blues.. Well.. YMMV, but I'd be checking backup integrity.. – Tom O'Connor Nov 23 '12 at 15:44
Semi-related, shouldn't a scrub catch any data integrity issues regardless of make and model drives? (Hopefully before one of the drives in the array fails...) Of course, that depends on one scrubbing *before* things go awry. Scrubbing a degraded array would probably make things worse, and certainly not better. – user Nov 23 '12 at 15:49
1

Again.. In theory. It might not catch it if the drive chips have gone nuts. I'd be less scared (somewhat) if it were hardware RAID. The make and model thing is really just.. what's the word?.. anecdotal evidence.. – Tom O'Connor Nov 23 '12 at 15:51
But there is some truth in Enterprise drives (the SAS controlled ones, and the higher end Black ones having higher MTBF, and the ability to recover from more read errors before going entirely loopy). – Tom O'Connor Nov 23 '12 at 15:52
2

http://www.standalone-sysadmin.com/blog/2012/08/i-come-not-to-praise-raid-5/ for more info on MTBF and UREs – Tom O'Connor Nov 23 '12 at 15:52
Tom, thanks a million! I'm scheduling the HDD replacement time with the hosting technicians. – Hristo Deshev Nov 23 '12 at 16:19
Well, it turned out the drive was healthy enough to survive a RAID resync. My **other** hard drive on the server died, and I could resync the array off the first one... I've updated the question text with details. – Hristo Deshev Nov 24 '12 at 16:15
Yes. You should be replacing the drive that was failed previously and now works. It's trolling you. – Tom O'Connor Nov 24 '12 at 17:57

score -2 · Answer 2 · answered Nov 23 '12 at 15:33

Can't find any reliable source for validating my own opinion, but I really think this is not a hardware damage. It's more a kind of data-retrieval problem.

If any data is written to the disk as the exact same location the read-operation failed, it should be then be readable.

So, as a final note, your current data might not be recoverable on that drive, but since you have a RAID array you can still get your data back from the other drive and make a backup, then format your faulty drive and resynchronize your RAID array.

This problem might occur by electromagnetic fields altering the content of the harddrive.

I'm sorry.. but 10 years of experience with hard disk failures on linux, this is pretty much always a dead drive. It's either the disk platter failing, or the drive electronics. — Tom O'Connor, Nov 23 '12 at 15:45

Is my hard drive about to die?

2 Answers2