I have an md-based RAID5 array which has been working without issue for about 2 years. Yesterday I've had spontaneous disk and/or PHY resets on one disk (but not actual read errors). md marked the disk as faulty, with the remaining array state being 'clean, degraded', so I tried removing and re-adding it. md started resyncing the array at a good speed (140M/s), but at about 0.6% the resync speed began falling and in about 10 seconds mdadm quit with the message "md: md0: recovery interrupted" without any SCSI or other errors visible in dmesg output (my current SCSI logging level is set to 0x10012DD). This occurred on several attempts. smartctl -a
, smartctl -t short
and scanning the first 1% of all disks with badblocks
didn't turn up any errors. Read-only xfs_repair -n
on the degraded array showed a bunch of I/O errors and bad checksums, as expected, but after all these exercises resync got past the point where it was quitting earlier. I am now running badblocks
on the rest of the disks and hoping the array will eventually finish resyncing so I can add fresh disks and finally go up to RAID6, but naturally there is no guarantee this will happen, which leads to the question:
Is it possible to make md resync plow past errors and trash bad blocks? I would very much rather end up with a 0.01% corrupted array than nothing. The data I have in the array is not critical and I can re-check it for errors on higher levels, but recovering it from scratch would take a very long time.