I am having wildly different software raid10 performance and behavior on two otherwise identical machines.
I have two machines which are hardware identical, bought at the same time, with the same software versions, hardware versions, and firmware versions. Each has a SAS controller with 8 x 6 Gb/s channels going to a SAS enclosure which holds 12 SAS disks.
On machine 1, which is stable and seems to be working perfectly, each disk in the raid array behaves more or less identically: busy time is equal (about 33% across all disks in production load levels), and while the weekly software raid check runs, write and read performance is not degraded. The full raid check completes in about a day, using all available spare bandwidth to complete it as fast as possible. This amounts to about 200 MB/sec reads while this check completes.
Machine 2 is a problem child. The full raid check completes in basically never, although it is configured to also use all available disk bandwidth. While it is attempting to check, it plods along at 5 MB/sec, and write performance drops to about 30 MB/sec during this time. Also, four disks are at 35% busy, while the remaining ones are 22% busy on average.
After cancelling the raid check on machine 2, the write speed returns to about 160 MB/sec.
If I use dd
to test each individual mpath
device, on machine 1 I get most speeds around 145 MB/sec reading per drive, and the lowest of 119 MB/sec followed by 127 MB. The rest are all in the 145 MB/sec range.
On machine 2, I get speeds between 107 MB (x 3 disks) and the rest are all above 135 MB/sec, with the peak of 191 MB/sec (!) for one disk.
I admit to being well out of my comfort zone here, but I cannot find any evidence to draw a conclusion from. I have also checked SMART stats on each disk on both machines, and while there are a fair number of "read corrected" errors on all disks, there seems to be no correlation between the values and the read performance, nor between the busy% difference.
Nothing I can find explains the poor performance when performing a RAID check of the array on one box vs on the other. Suggestions on where to go next to debug this would be appreciated.