Preparing for RAID5 failure

Question

I've got an Ubuntu system with a bunch of hard disks in it acting as my home router, DHCP server, file server, etc. Twice in the past 24 hours it has suddenly decided to set the root filesystem to read only. I think there's a hardware failure on one of the drives. I've ordered a new drive just to be safe.

Jul  8 07:40:54 monolith kernel: [   42.851001] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul  8 07:40:54 monolith kernel: [   42.851047] ata3.00: BMDMA stat 0x24
Jul  8 07:40:54 monolith kernel: [   42.851089] ata3.00: cmd c8/00:08:67:6a:00/00:00:00:00:00/e0 tag 0 dma 4096 in
Jul  8 07:40:54 monolith kernel: [   42.851134] ata3.00: status: { DRDY ERR }
Jul  8 07:40:54 monolith kernel: [   42.851173] ata3.00: error: { UNC }

My main questions are: do you think this is indicating incipient hard drive failure? I looked at smartctl but I'm not really sure what I'm looking for.

Also, is there a way to figure out which /dev/sd* ata3 corresponds to?

/proc/mdstat says:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sda3[0] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1]
      4877654400 blocks level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]

unused devices: <none>

Which I think looks good.

What would you do if you were in my shoes, staring down a possible RAID failure?

Do you have backups? If not, explain in 50 words or less why you don't care about your data... — voretaq7, Jul 08 '13 at 16:51
Home systems are off topic on serverfault. That error could be a dying drive, loose cable, or failing controller on the motherboard. Make backups regularly. — Grant, Jul 08 '13 at 16:53
Raid 5 is quick and efficient for temporary files or swap (even if hibernation with new linux kernel and proper initrd) but not recommended for storing long term datas. — F. Hauri - Give Up GitHub, Jul 08 '13 at 17:22

F. Hauri - Give Up GitHub · Answer 1 · 2013-07-08T17:29:00.453

When using RAID5, it's important to have some spare disks, ready to start when one drive fail, but.

In a system with a number of disk that run together from a while, when the first disk crash, there is a lot of chance that the second will fail in a very short interval!

and

When a drive is failed on raid 5, other drives have to work more strongly restore missing datas. So while reconstructing, the whole bunch of disks are working very hardly and the probability to see another crash in this laps is high!

So make your backups in time!

And even before installing a new disk!

Keep in mind

Even well tested in degraded mode, once under production charge, a degraded raid5 array implied a slower system in a significant proportion.

Preparing for RAID5 failure

1 Answers1