4

I run a new CentOS 7 machine. Linux runs on 2x SSD setup, and I also have 4x SAS drives setup in software RAID10. The RAID10 array is large, 4x 12TB drives, so 24TB usable.

File system is: ext4

Now I finished copying some files to it, and I'm doing a raid check (very first one).

Every 2.0s: cat /proc/mdstat                                                                                                                                                                                         Mon Oct 14 06:28:38 2019

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md127 : active raid10 sdf1[3] sdd1[1] sde1[2] sdc1[0]
      23437503488 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [======>..............]  check = 32.6% (7649123136/23437503488) finish=3402.6min speed=77333K/sec
      bitmap: 0/175 pages [0KB], 65536KB chunk

md2 : active raid1 sdb2[1] sda2[0]
      20478912 blocks [2/2] [UU]

md3 : active raid1 sdb3[1] sda3[0]
      447318976 blocks [2/2] [UU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

unused devices: <none>

It started around 250,000K/sec but it keeps getting slower, and it now it's around 75,000K/s

The drives in the RAID10 array are not being use by anything else at the moment.

I already tweaked the speed limit settings.

dev.raid.speed_limit_min = 100000

dev.raid.speed_limit_max = 1000000

CPU usage is on like 2%, I got tons of RAM free, and the 4 drives in the RAID array are reporting about 25% drive utilization per drive, so they are not being pushed hard by resync.

My question:

  1. What can I do to speed this up?

  2. And what could be causing it to slow down?

Mr.Boon
  • 1,471
  • 4
  • 24
  • 43
  • Been going for 24h now since I posted this. It's done 50%, but speed is slowed down to about 25,000K/s, down from 250,000K/s when it first started. CPU, RAM & IO load is still low. – Mr.Boon Oct 15 '19 at 05:02
  • 1
    Can you post the output of `/sys/block/md127/md/sync_speed`, `sync_speed_min` and `sync_speed_max`? Please also show 60 seconds of `iostat -x -k 1` (using an external pasting service as [pastebin](https://pastebin.com/)) – shodanshok Oct 15 '19 at 07:46
  • Hi, thanks for your feedback. I've posted the output of all those commands here: https://pastebin.com/MncwCP01 Any help would be appreciated. Raid check is currently on 53% – Mr.Boon Oct 15 '19 at 12:04
  • 1
    Try setting `sync_speed_min` to `50000` and `sync_speed_max` to `200000`. Does it change anything? If not, please post the output of `smartctl --all ` and `dmesg`. – shodanshok Oct 15 '19 at 14:59
  • Changing those settings didn't help unfortunately. Smartctl output of the 4 raid drives https://pastebin.com/jwYGULYZ Thank you for your help. dmesg output seems a bit large to paste. – Mr.Boon Oct 15 '19 at 16:39
  • 1
    It seems the array has some very brief burst of activity, then it does basically nothing for extended periods of time. Can you post the output of `cat /proc/sys/dev/raid/speed_limit_m*`? Do you see any blocked task in `dmesg` or syslog? – shodanshok Oct 15 '19 at 22:25
  • The speed limit settings are correct. Both min and max are set high enough. I've outputted dmesg here: https://pastebin.com/Mb1ZxhxK and the output of "messages" here: https://pastebin.com/eVDrKE9G There are some errors happening it seems, poweron-off stuff also. Any idea what might be happening? – Mr.Boon Oct 16 '19 at 07:19
  • 1
    your `message` file show exactly what I expected: a disk/enclosure continuously aborting commands and resetting. The affected disk seems always to be `sdc`. I'll write an answer to describe what you can/should do. – shodanshok Oct 16 '19 at 08:34

1 Answers1

2

Your message file show exactly what I expected: a disk/enclosure continuously aborting commands and resetting. The affected disk seems always to be sdc, so it is probably the culprit.

The obvious action to solve the problem is to replace it. However, I would first try to:

  • reseat your drive and power/data cables;
  • swap sdc with another disk (to change SAS cable/power cord) and check if the errors follows the drive or remain bound to the very same slot/port;
  • optionally, read directly from the disk via dd if=/dev/sdc of=/dev/null bs=1M iflag=direct to gain additional debug data.

If you can't, for some reason, replace the drive, you can try forcing bad blocks reallocation by completely rewrite the device via dd if=/dev/zero of=/dev/sdc bs=1M oflag=direct. BIG WARNING: this will completely and irreversibly destroy all data on sdc. Try it only if you really can't replace the drive.

shodanshok
  • 47,711
  • 7
  • 111
  • 180