I have a raid5 array that has a check run on it once a month. It is configured so that the check runs for 6 hours from 01:00 and then stops. The following nights it will resume the check for another 6 hours until it has completed.
The issue I have is that sometimes when mdcheck attempts to stop the check running it hangs. Once this happens you can read from the array, but any attempt to write results in the process hanging.
The array state is as follows:
md0 : active raid5 sdb1[4] sdc1[2] sdd1[5] sde1[1]
8790398976 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
[========>............] check = 44.2% (1296999956/2930132992) finish=216065.8min speed=125K/sec
bitmap: 0/6 pages [0KB], 262144KB chunk
The check = 44.2% (1296999956/2930132992)
never advances or stops.
From looking at the /usr/share/mdadm/mdcheck
script it appears that every 2 minutes, until the end time, it reads /sys/block/md0/md/sync_completed
and saves the position in a file stored in the /var/lib/mdcheck/
directory. Looking in that directory the file is there and is dated 2 minutes before it was due to stop with the value of 2588437040
. The current value of sync_completed
is 2593999912
which indicates that everything was still working 2 minutes before it was due to stop.
Running lsof
on the mdcheck
process reveals the following:
mdcheck 23887 root 1w REG 0,21 4096 43388 /sys/devices/virtual/block/md0/md/sync_action
This appears to show that the mdcheck process is hanging when trying to stop the check after 6 hours. I confirmed this by running the following in a terminal:
sudo echo idle >/sys/devices/virtual/block/md0/md/sync_action
and this also hung.
The only way I have found to stop the check is to attempt a reboot, which also hangs, and then cycle the power.
How do I stop/unhang the mdcheck (and hence the array) without a reboot and how do I find out what the cause of the issue is (and resolve it)?
Additional information:
OS: OpenSUSE Leap 15.2
Kernel: 5.3.18-lp152.57-default
Running the consistency check without interruption succeeds.
Running extended self tests on the disks succeeds.
Replacing all the SATA cables has no effect.
Relevant dmesg
entries:
[ 5.565328] md/raid:md0: device sdb1 operational as raid disk 3
[ 5.565330] md/raid:md0: device sdc1 operational as raid disk 2
[ 5.565331] md/raid:md0: device sdd1 operational as raid disk 0
[ 5.565332] md/raid:md0: device sde1 operational as raid disk 1
[ 5.575520] md/raid:md0: raid level 5 active with 4 out of 4 devices, algorithm 2
[ 5.640309] md0: detected capacity change from 0 to 9001368551424
[53004.024693] md: data-check of RAID array md0
[74605.665890] md: md0: data-check interrupted.
[139404.408605] md: data-check of RAID array md0
[146718.260616] md: md0: data-check done.
[1867115.595820] md: data-check of RAID array md0
Output of mdadm --detail /dev/md0
:
Version : 1.2
Creation Time : Sat Nov 7 09:48:15 2020
Raid Level : raid5
Array Size : 8790398976 (8.19 TiB 9.00 TB)
Used Dev Size : 2930132992 (2.73 TiB 3.00 TB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Feb 2 06:59:55 2021
State : active, checking
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Check Status : 44% complete
Name : neptune:0 (local to host neptune)
UUID : 5dd490df:79bf70fa:b4b530bc:47b30419
Events : 28109
Number Major Minor RaidDevice State
5 8 49 0 active sync /dev/sdd1
1 8 65 1 active sync /dev/sde1
2 8 33 2 active sync /dev/sdc1
4 8 17 3 active sync /dev/sdb1
Output of mdadm --examine /dev/sdb1
(all disks are essentially the same):
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : 5dd490df:79bf70fa:b4b530bc:47b30419
Name : neptune:0 (local to host neptune)
Creation Time : Sat Nov 7 09:48:15 2020
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5860266895 sectors (2.73 TiB 3.00 TB)
Array Size : 8790398976 KiB (8.19 TiB 9.00 TB)
Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
Data Offset : 264192 sectors
Super Offset : 8 sectors
Unused Space : before=264112 sectors, after=911 sectors
State : clean
Device UUID : a40bb655:70a88240:06dfad1d:f7fcbdca
Internal Bitmap : 8 sectors from superblock
Update Time : Tue Feb 2 06:59:55 2021
Bad Block Log : 512 entries available at offset 16 sectors
Checksum : 42b3d6 - correct
Events : 28109
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)