Sorry for late arrival. So, I am suprised nobody answered this. There is even a link to a similar problem, but I doubt cables are in play in this case.
You started a sync to a new disk, but when sync went to 30%, the source (the last drive left that has all the data) encountered the read error. In case of read errors Linux MD RAID driver queries reads from other component devices, but in this case there is no synched component device to read from so it gives up. It'll stop sync on first such unrecoverable error and then restart a sync from the start. Of course, pulling spare out and re-adding it won't help. You have to use other ways to complete the sync or otherwise retrieve (slightly corrupted) data in such case.
The system might work perfectly, because this sector may not contain any data so it never tried to read from in during normal operation, but RAID sync is a special case, where it reads everything. We call such cases a silent bad blocks.
The first idea is to force drive to remap the bad block internally. Unfortunanely there is impossible to do this with guarantee, but there is a high chance that if you write this particular sector, it'll get remapped and then read back successfully. To do that, one can use a hdparm
utility (notice --repair-sector
is a alias for --write-sector
):
hdparm --write-sector 448271680
I deliberately put almost a random number here. That's 896543360/2, where the big number was taken from dmesg
error message. You have to calculate it yourself for your case. Be extremly careful. I suggest to do a read check (--read-sector
) with the same number, to trigger the same error message and therefore to prove this is indeed the right sector. Note, you will lose anything in this sector, but it is unreadable anyway, so it is already essentially lost, and if it is silent, there was no useful information.
Repeat this for all unreadable blocks. You'll need to replace this drive too, when sync is complete.
Other way to help the situation requires service stopping for an extended period of time. You need to stop the faulty RAID and run ddrescue
from faulty disk to a new disk. After that, you neen first to remove old device completely and start the system from a new disk (with degraded arrays, I know). Then, if it works, add another new disk and complete the sync.
In case you wondered, I've happened to do successful repairs both ways.
The lesson here is: just having a RAID is not enough; for data to be safe you need to monitor your array health, scrube it periodically (i.e. perform a read check for all devices and compare — to be sure every block gets read) and, of course, take required actions timely. Hardware RAIDs also have capabilities to set up automatic periodic scrubbing. For each MD RAID, you should do once a month:
echo check >> /sys/block/md0/md/sync_action
(Debian has this by default, AFAIK). So when some disk gets silent unreadable sector, in a month you'll discover that. Then don't forget to replace the dying disk as soon as possible!