5

Two of four of my servers currently have mismatch_cnt about 40000 and that worries me. We are using RAID10 setup. Manual states, that

However on RAID1 and RAID10 it is possible for software issues to cause a mismatch to be reported. This does not necessarily mean that the data on the array is corrupted. It could simply be that the system does not care what is stored on that part of the array - it is unused space.

We do not use any swap files on our servers. One of the server's HDDs has SMART self-check failing and Available_Reservd_Space is too low. Hosting provider says, that it replaces HDDs only when they are physically faulty.

I think I do not get the real meaning and the usefullness of this param. What could be other reasons for this parameter to have such a big value? How could that be that the system does not care about what is there on that part of array if that's a mirrored one? Due to security considerations a system should sync free space also I think and then - what's left?

Are there any reliable ways to estimate the risk of having a particular HDD in a server?

Vladislav Rastrusny
  • 2,671
  • 12
  • 42
  • 56

1 Answers1

4

Often, two reasons are given for high mismatch_cnt on a RAID1/10 array:

  • swap on the array
  • very fast file creation/writing/rewriting/deletion workloads

The above reason are harmless: while they do point to differences in the array (basically, a de-synchronized array), they are about unused disk space.

However, there is a much more concernig and dangerous mismatch_cnt cause: an hardware issue (ie: faulty power supply delivering inconsistent power and/or a misbehaving disk DRAM chip) can alter in-flight data, leading to many inconsistencies between the two disks.

You can find more information on this thread in the linux-raid mailing list.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • What if `echo 'repair' >/sys/block/md2/md/sync_action` does not make this number zero (it is still like 40k pieces and doesn't go down), but mdadm does not bring the drive offline? – Vladislav Rastrusny Nov 28 '17 at 20:01
  • 1
    `repair` repairs mismatches and it reports the number of mismatches found. In other word, it is perfectly normal for `repair` to show the same mismatch count given by a `check`. For having an accurate post-repair `mismatch_cnt` value, you *must* execute a `check` just after the `repair` action. If a `repair` is immediately followed by a `check`, it should give 0 mismatches. – shodanshok Nov 28 '17 at 20:13
  • I see. It should be zero if you have no background process, that, for example, writes to a memory-mapped file while RAID is being checked or repaired, right? Because if you have, you can again have a small number of mismatches due to asyncronous disk writes. As long as it stays small in this case, everything should be ok. If it is still big, can we assume, that it is caused by disk media corruption for instance? – Vladislav Rastrusny Nov 28 '17 at 20:54