How to solve a system freeze when mounting a mdadm raid-5 array?

Question

I recently set up a raid-5 array with 4 disks on Ubuntu 22.04. When I try to mount the array using

sudo mount /dev/md0 /mnt/md0

the entire system freezes. I can see the desktop but the mouse doesn't respond. Ctrl+F1 doesn't work, the system appears to be disconnected from the network, and even Alt+SysReq+REISUB appears to do nothing!

I have checked /var/log/syslog and no errors or anything to indicate the crash is showing.

cat /proc/mdstat has the following output:

Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10] 
md127 : active raid1 sdf1[0] sdf2[1]
      1999808 blocks [2/2] [UU]
      
md0 : active raid5 sde[2] sdc[3] sdd[1] sda[0]
      11720659392 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/30 pages [0KB], 65536KB chunk

unused devices: <none>

sudo /usr/share/mdadm/checkarray -a /dev/md0 get's to about 0.3% then crashes in the same way.

For context I originally had a raid-1 array with 2 disks and converted this to a raid-5 array with 2 more disks, all 4tb. I did manage to mount the new array at one point, but the problems began soon after the conversion. I think it may have been as I was growing the array from 4tb to 12tb, although I'm not sure. I have also tried reinstalling ubuntu but the problem persists.

Last part of of /var/log/syslog before the freeze:

Aug 16 10:58:16 dolly systemd[1]: Reloading.
Aug 16 10:58:16 dolly systemd[1]: Mounting Mount unit for snapd, revision 16292...
Aug 16 10:58:16 dolly kernel: [  326.362036] loop8: detected capacity change from 0 to 96176
Aug 16 10:58:16 dolly systemd[1]: Mounted Mount unit for snapd, revision 16292.
Aug 16 10:58:17 dolly kernel: [  327.149055] audit: type=1400 audit(1660643897.246:60): apparmor="STATUS" operation="profile_load" profi
le="unconfined" name="/snap/snapd/16292/usr/lib/snapd/snap-confine" pid=3145 comm="apparmor_parser"
Aug 16 10:58:17 dolly kernel: [  327.150068] audit: type=1400 audit(1660643897.246:61): apparmor="STATUS" operation="profile_load" profi
le="unconfined" name="/snap/snapd/16292/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=3145 comm="apparmor_parser"
Aug 16 10:58:18 dolly kernel: [  328.322198] audit: type=1400 audit(1660643898.418:62): apparmor="STATUS" operation="profile_replace" in
fo="same as current profile, skipping" profile="unconfined" name="snap-update-ns.firefox" pid=3147 comm="apparmor_parser"
Aug 16 10:58:18 dolly kernel: [  328.394564] audit: type=1400 audit(1660643898.490:63): apparmor="STATUS" operation="profile_replace" in
fo="same as current profile, skipping" profile="unconfined" name="snap-update-ns.snap-store" pid=3148 comm="apparmor_parser"
Aug 16 10:58:19 dolly kernel: [  329.413958] audit: type=1400 audit(1660643899.510:64): apparmor="STATUS" operation="profile_replace" in
fo="same as current profile, skipping" profile="unconfined" name="snap-update-ns.snapd-desktop-integration" pid=3149 comm="apparmor_pars
er"
Aug 16 10:58:19 dolly kernel: [  329.429861] audit: type=1400 audit(1660643899.526:65): apparmor="STATUS" operation="profile_replace" in
fo="same as current profile, skipping" profile="unconfined" name="snap.firefox.hook.configure" pid=3160 comm="apparmor_parser"
Aug 16 10:58:19 dolly kernel: [  329.440241] audit: type=1400 audit(1660643899.538:66): apparmor="STATUS" operation="profile_replace" in
fo="same as current profile, skipping" profile="unconfined" name="snap.snap-store.hook.configure" pid=3161 comm="apparmor_parser"
Aug 16 10:58:19 dolly kernel: [  329.441872] audit: type=1400 audit(1660643899.538:67): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.firefox.firefox" pid=3159 comm="apparmor_parser"
Aug 16 10:58:19 dolly kernel: [  329.459354] audit: type=1400 audit(1660643899.558:68): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.snap-store.snap-store" pid=3162 comm="apparmor_parser"
Aug 16 10:58:19 dolly kernel: [  329.460914] audit: type=1400 audit(1660643899.558:69): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.snap-store.ubuntu-software" pid=3163 comm="apparmor_parser"
Aug 16 10:58:19 dolly snapd[740]: daemon.go:521: gracefully waiting for running hooks
Aug 16 10:58:19 dolly snapd[740]: daemon.go:523: done waiting for running hooks
Aug 16 10:58:20 dolly snapd[740]: overlord.go:504: Released state lock file
Aug 16 10:58:20 dolly systemd[1]: snapd.service: Deactivated successfully.
Aug 16 10:58:20 dolly systemd[1]: snapd.service: Consumed 9.814s CPU time.
Aug 16 10:58:20 dolly systemd[1]: snapd.service: Scheduled restart job, restart counter is at 1.
Aug 16 10:58:20 dolly systemd[1]: Stopped Snap Daemon.
Aug 16 10:58:20 dolly systemd[1]: snapd.service: Consumed 9.814s CPU time.
Aug 16 10:58:20 dolly systemd[1]: Starting Snap Daemon...
Aug 16 10:58:20 dolly snapd[3170]: AppArmor status: apparmor is enabled and all features are available
Aug 16 10:58:20 dolly snapd[3170]: overlord.go:263: Acquiring state lock file
Aug 16 10:58:20 dolly snapd[3170]: overlord.go:268: Acquired state lock file
Aug 16 10:58:20 dolly snapd[3170]: daemon.go:247: started snapd/2.56.2+22.04ubuntu1 (series 16; classic) ubuntu/22.04 (amd64) linux/5.15.0-46-generic.
Aug 16 10:58:20 dolly kernel: [  330.352142] loop9: detected capacity change from 0 to 8
Aug 16 10:58:20 dolly systemd[1]: tmp-syscheck\x2dmountpoint\x2d2185967355.mount: Deactivated successfully.
Aug 16 10:58:20 dolly snapd[3170]: daemon.go:340: adjusting startup timeout by 1m10s (pessimistic estimate of 30s plus 5s per snap)
Aug 16 10:58:20 dolly systemd[1]: Started Snap Daemon.

This dmesg before freeze is absolutely useless. You can try to get (some) of the dmesg after the freeze by configuring [`netconsole`](https://www.kernel.org/doc/Documentation/networking/netconsole.txt) to some neighbor machine and recording logs there; there is a hope that it succeeds in sending a message before it freezes. // I suspect you have somewhat faulty drive. Check all SMART, try to read from each drive individually with `dd if=/dev/sdX of=/dev/zero`, examine each drive metadata with `mdadm --examine /dev/sdXY` and so on. Of course, do it with working netconsole. — Nikita Kipriyanov, Aug 16 '22 at 13:23
Also a "good practice" remark: avoid RAID5, especially on large drives. (Anything bigger than those old 146 GB SAS rust is "large" for this matter.) Yes, by using RAID10 or RAID6 you lose some more capacity, but with RAID5 you'll likely to eventually lose your data, and no, its "redundancy" won't help you. Just google, there are plenty of sad stories about it. — Nikita Kipriyanov, Aug 16 '22 at 13:29
Thanks for the reply. I checked each drive SMART and all appeared healthy. I'll try the other tips though thanks! — oh_cripes, Aug 17 '22 at 16:48
@NikitaKipriyanov The restriction on RAID5 for large volumes applies less to SSD than it does to spinning rust. An example article is [here](https://www.kjctech.net/why-raid-5-is-ok-on-ssd-drives/). Rebuild times are much shorter on ssd and UREs are less of a problem. Major vendors (Dell, HPE) are find with RAID5 on ssd. — doneal24, Aug 17 '22 at 19:59

How to solve a system freeze when mounting a mdadm raid-5 array?

0 Answers0