mdadm RAID5 array recovery stops without SCSI errors

Question

I have an md-based RAID5 array which has been working without issue for about 2 years. Yesterday I've had spontaneous disk and/or PHY resets on one disk (but not actual read errors). md marked the disk as faulty, with the remaining array state being 'clean, degraded', so I tried removing and re-adding it. md started resyncing the array at a good speed (140M/s), but at about 0.6% the resync speed began falling and in about 10 seconds mdadm quit with the message "md: md0: recovery interrupted" without any SCSI or other errors visible in dmesg output (my current SCSI logging level is set to 0x10012DD). This occurred on several attempts. smartctl -a, smartctl -t short and scanning the first 1% of all disks with badblocks didn't turn up any errors. Read-only xfs_repair -n on the degraded array showed a bunch of I/O errors and bad checksums, as expected, but after all these exercises resync got past the point where it was quitting earlier. I am now running badblocks on the rest of the disks and hoping the array will eventually finish resyncing so I can add fresh disks and finally go up to RAID6, but naturally there is no guarantee this will happen, which leads to the question:

Is it possible to make md resync plow past errors and trash bad blocks? I would very much rather end up with a 0.01% corrupted array than nothing. The data I have in the array is not critical and I can re-check it for errors on higher levels, but recovering it from scratch would take a very long time.

score 0 · Accepted Answer · answered Oct 13 '20 at 07:54

Looking at the driver code in raid5.c, it doesn't look to be possible to force md to ignore errors during resync. However, if nothing else helps, as a last resort it is possible to reassemble the array without a mandatory resync by re-creating it with --assume-clean, see e.g. RAID Wiki and this answer.

score 0 · Answer 2 · answered Oct 13 '20 at 08:13

0

Read-only xfs_repair -n on the degraded array showed a bunch of I/O errors and bad checksums, as expected

It is not expected: one faulty / missing disk should cause no data corruption in an otherwise good RAID5 array. You probably have multiple unreadable data sector on one or the other disks. While recent mdadm versions can be forced to continue recovery, the internal bad blocks list is quite small and reconstruct aborts when it is full.

I suggest you to double-check the health of all your disks.

answered Oct 13 '20 at 08:13

shodanshok

47,711
7
111
180

That was almost the first thing I did. The disks are all healthy, SMART is clean and extended self-tests passed. The problem seems to have been insufficient power supply, which was causing some disks to reset under load, thus I/O errors. – Anton Tykhyy Oct 15 '20 at 12:56

mdadm RAID5 array recovery stops without SCSI errors

2 Answers2