4

I've got a RAID-5 mdadm array that reliably causes Buffer I/O error on dev md0, logical block 1598030208, async page read to be written do dmesg when reading that block. Of course, the read also factually fails. This behavior is consistent across reboots, and it's always the same block.

I would like to understand where the error comes from. As I understand it, either one of the physical drives must cause trouble, or maybe the array is in an inconsistent state. Either way, I would like to know which it is, so that I can take further steps to try and fix the issue.

Unfortunately, dmesg does not give further hints at all. I have looked at the smart parameters of all drives involved, but none raises suspicion. What else can I try to troubleshoot the array?

Thanks in advance!

Edit: As requested, the output of mdadm --detail /dev/md0:

/dev/md0:
        Version : 1.0
  Creation Time : Sat Dec 28 03:50:47 2013
     Raid Level : raid5
     Array Size : 15621798144 (14898.11 GiB 15996.72 GB)
  Used Dev Size : 3905449536 (3724.53 GiB 3999.18 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

    Update Time : Fri Dec 22 11:36:24 2017
          State : clean 
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : 0
           UUID : 01a3d3c1:6a5ac63d:0cc10dd0:f8e7a1c4
         Events : 2132931

    Number   Major   Minor   RaidDevice State
       5       8       51        0      active sync   /dev/sdd3
       1       8       83        1      active sync   /dev/sdf3
       4       8       35        2      active sync   /dev/sdc3
       7       8       67        3      active sync   /dev/sde3
       6       8        3        4      active sync   /dev/sda3

Update: I tried scrubbing the array by writing repair to md/sync_action. The process completed without any output to dmesg or signs of trouble in /proc/mdstat. However, reading from the array still fails at the same block as above, 1598030208.

Update 2, for reference: I asked this question in the linux-raid mailing list: https://marc.info/?l=linux-raid&m=151486117529497&w=2

RQM
  • 141
  • 1
  • 3
  • 2
    Whats the output of `mdadm --detail /dev/md0`? – Deeh Dec 26 '17 at 19:48
  • @Deeh: Thanks for asking, I've edited my post to show the output. – RQM Dec 26 '17 at 20:12
  • 1
    Need more version numbers for your md package, kernel, distribution, and a general idea of what the workload is like. – Spooler Dec 26 '17 at 20:16
  • 1
    @SmallLoanOf1M: It's Debian 9 stretch, x86_64, kernel 4.9.30-2+deb9u5, mdadm v3.4, all software as distributed via the Debian repositories. The machine is used as a backup server, i.e. writes of a few TB at most, once a day, from one source only. This is also where I first encountered errors resulting in a read-only remount, which prompted me to run `fsck` on `md0` (unmounted), which consistently produces the error I described in my post. – RQM Dec 26 '17 at 20:33
  • 2
    Check your drives with `smartctl -a /dev/sda`. Add output of `fdisk -l /dev/sda`. Did anything change recently (drive replacement/repartitioning)? – Deeh Dec 26 '17 at 20:45
  • @Deeh: Like I said, `smartctl` shows no suspicious signs (there have been a few errors in the past on one disk, but none that coincide with the errors I see now.) The disk partition layouts match. Regarding changes: these disks used to be part of a qnap-NAS that failed recently (SATA-controller or backplane). I migrated the disks, and am now able to mount the `ext4` filesystem normally. I'm just curious whether I can make `mdadm` tell me why it can't read that one block. Surely the raid subsystem must know why it can't produce the data? – RQM Dec 26 '17 at 21:54
  • Run `smartctl -t long /dev/sd*` on all your drives. One (or possibly more) will fail with an error. Then you will know which drive is failing and should be replaced. – Michael Hampton Jan 27 '18 at 23:26
  • @MichaelHampton: Thanks, but that's not it in this case. I had three passes of badblocks work on each disk, followed by a smart self-test. Not a single error produced on any of the drives, neither for badblocks nor for smart. – RQM Jan 29 '18 at 13:07

0 Answers0