2

For several years now we run a software RAID1 on a old Gentoo box with a custom Linux 2.6.31. The RAID consists of 2 harddisk with 4 partitions each. In the last years it happend about 3-4 times that a disk was thrown out of the array. But each time badblocks didn't report an error and i was able to reactivate the disk like this

mdadm /dev/md3 -r /dev/sda3
mdadm /dev/md3 -a /dev/sda3

This time the situation is different: mdadm reported 2 faulty partitions over the past 24 h, both on the same disk sda. Again I ran badblocks without success: It reported 0 bad blocks. If I try to add the faulty disk back to the array, it breaks with the same error each time:

Mar 25 23:09:10 xen0 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 25 23:09:10 xen0 kernel: ata1.00: irq_stat 0x40000001
Mar 25 23:09:10 xen0 kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Mar 25 23:09:10 xen0 kernel: res 51/04:00:38:df:f7/00:00:00:00:00/a7 Emask 0x1 (device error)
Mar 25 23:09:10 xen0 kernel: ata1.00: status: { DRDY ERR }
Mar 25 23:09:10 xen0 kernel: ata1.00: error: { ABRT }
Mar 25 23:09:10 xen0 kernel: ata1.00: configured for UDMA/133
Mar 25 23:09:10 xen0 kernel: ata1: EH complete
Mar 25 23:09:10 xen0 kernel: end_request: I/O error, dev sda, sector 18297870
Mar 25 23:09:10 xen0 kernel: md: super_written gets error=-5, uptodate=0
Mar 25 23:09:10 xen0 kernel: md: md3: recovery done.
Mar 25 23:09:10 xen0 kernel: RAID1 conf printout:
Mar 25 23:09:10 xen0 kernel: --- wd:1 rd:2
Mar 25 23:09:10 xen0 kernel: disk 0, wo:1, o:0, dev:sda3
Mar 25 23:09:10 xen0 kernel: disk 1, wo:0, o:1, dev:sdb3
Mar 25 23:09:10 xen0 kernel: RAID1 conf printout:
Mar 25 23:09:10 xen0 kernel: --- wd:1 rd:2
Mar 25 23:09:10 xen0 kernel: disk 1, wo:0, o:1, dev:sdb3

The sector is always the same: 18297870. If I check the output of smartctl -a /dev/sda it doesn't show any reallocated sectors, so iIthought, I could enforce reallocation with a method I found here.

hdparm --read-sector  18297870 /dev/sda
hdparm --write-sector  18297870 --yes-i-know-what-i-am-doing /dev/sda

But with the above commands the disk does not report any error at all - and thus does not reallocate the sector:

smartctl -a /dev/sda | grep -i reall
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

I googled for the status: { DRDY ERR } message and some say this could be a kernel bug. But then I wonder why this just started to happen now. There was no change to the system over the past years.

There's one thing which still bugs me though: smartctl -a /dev/sda also reports some errors in the logs. It's always the same error:

Error 15 occurred at disk power-on lifetime: 39971 hours (1665 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 38 df f7 a7

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 00 08  35d+12:17:45.571  FLUSH CACHE EXT
  61 80 f0 0e 96 2d 00 08  35d+12:17:44.033  WRITE FPDMA QUEUED
  61 80 e8 8e 95 2d 00 08  35d+12:17:44.033  WRITE FPDMA QUEUED
  61 80 e0 0e 95 2d 00 08  35d+12:17:44.033  WRITE FPDMA QUEUED
  61 80 d8 8e 94 2d 00 08  35d+12:17:44.033  WRITE FPDMA QUEUED

All this doesn't make sense to me: If the disk is really faulty, then why can i successfully read and write to the sector at which the RAID resync fails each time? And why doesn't the disk first try to reallocate the broken sector if it really detects an error?

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209

2 Answers2

1

If as per your comment the error happens all the time on the same sector than it is a disk problem, for some reason the disk times out on writing there (based on the smart error log) and either it can't reallocate it or for some unknown reasons it decides not to reallocate.

In a SAS disk you could use an explicit command to reallocate the sector, I'm not aware of such a command in SATA though.

Your only option would be to try an explicit write into the sector while setting a very large timeout on the disk. You can control the disk timeout by setting /sys/block/sdX/device/eh_timeout to a large value, even as high as 600 seconds (5 minutes). This could help write the bad location and avoid aborting the request every time.

Baruch Even
  • 1,073
  • 6
  • 18
  • Thanks. Can you also maybe elaborate why you think that tools like `badblocks` fail or the reallocation of bad blocks fails? I'm not a harddisk expert at all. – Michael Härtl Aug 30 '14 at 10:14
  • The reasons why the disk will not reallocate are unknown to me too. As far as why badblocks would fail to detect it. That's a good question and I have no working theory to try and understand it. The only thing I can guess is that badblocks works by reading and your errors seem to be on write. It's rather a flimsy theory though. – Baruch Even Aug 31 '14 at 03:46
-2

You've got RAID1 so you won't have any data loss. I would replace the disk and make another resync.

Spacedust
  • 568
  • 5
  • 13
  • 28
  • This is beyond wrong. – Jacob Mar 31 '13 at 18:42
  • 1
    Devil's advocate - if you've got a disk that is behaving in odd ways like one minute throwing what look like hard errors and the next allowing write with no reallocations, why WOULDN'T you consider just replacing it? – grifferz Mar 31 '13 at 20:01
  • Thanks for the answer. I know, that i could (and should!) replace the hard disk. Actually i'm already replacing the system with a new one. But that still does not answer my question. It's not a sporadic behavior - i can reproduce the `badblocks` and `hdparm` result each and every time. It only gives me an error when i try to add the hd back to the array - each time at exactly the same sector. – Michael Härtl Apr 01 '13 at 13:47