Linux Software RAID 1 issue

Question

I recently had a HDD failure on a SW RAID1 system (Debian 6.0) and what happened was that the active HDD looked like it had some badblocks which somehow propagated to the HDD that was still OK but it was set as spare and couldn't synchronize. This basically is my assumption as I cannot say for sure.

I was wondering if any of you knows if it is possible that the errors from a broken HDD to propagate to the other HDD and if so if there is any setting for something like this not to happen?

Any insights on this matter would be greatly appreciated. Thank you.

score 1 · Accepted Answer · answered Sep 11 '12 at 12:31

1

If Linux software RAID knows it is reading corrupted data it will not mirror it. However, if your disk is failing and providing incorrect data silently, there's no setting or such to recover that in RAID. It simply does not have knowledge on which data to trust if blocks are not equal on both disks.

However, you mention it did identify the blocks as being 'bad'. In such an event mdadm will kick (marked as faulty) that disk and you'll have to start the array degraded manually using the correct disk. It will prevent you to get back in sync with that faulty disk unless you're forcing it.

The best approach in trying to prevent silent data corruption is using file system level mirroring, like ZFS and btrfs offer. It will withstand some data corruption at physical level, because it checks all data by using parity calculations. It may be slower in some cases, though.

answered Sep 11 '12 at 12:31

gertvdijk

3,504
4
30
46

as far as I understand _zfs_ and _btrfs_ are file systems and I would like to still use ext4. Are you aware of any solution that I could use on a ext4 file system? Would S.M.A.R.T be of any help in such a case? – Alex Flo Sep 11 '12 at 12:43
@AlexFlo You should have been using SMART all along. It could have warned you of a failing disk _before_ it became a serious problem. In any case, I'd say RAID 10 and replace disks at the first sign of trouble. – Michael Hampton Sep 11 '12 at 12:49
@AlexFlo S.M.A.R.T. *could* notice a failure is upcoming. However, it is diagnostics provided by the disk itself and can be misleading too. Only solution for traditional file systems is a good external incremental backup solution for point-in-time recovery. – gertvdijk Sep 11 '12 at 12:49

Linux Software RAID 1 issue

1 Answers1