1

I have an issue where a mounted external USB hard drive fails, this causes the IO Wait of the system to spike (I can see this in Grafana) but shortly after the system becomes completely unresponsive, so I cannot ssh into it to force a umount or to reboot.

I am presently mounting the drive like this:

UUID=UUID_HERE       /data/path/disk1       ext4    user,defaults,nofail,noatime,commit=60,x-systemd.device-timeout=120,errors=continue     0       2 # Added 2023-01-03 09:41:46 +0000

How can I prevent this from happening? - is there some fstab mount option to automatically unmount after a number of errors? - or to force a umount after a read/write timeout? - Or maybe if there is a read timeout to somehow kill power to the USB ports?

Hackeron
  • 111
  • 4
  • Do you see I/O errors logged, or is the problem that performance is horribly slow? – John Mahowald Jan 19 '23 at 22:47
  • I'm not entirely sure, it seems to just be extremely slow as sometimes I would get a Grafana update 20 hours later, but it just times out any attempts to ssh so I cannot test. Because these units are installed in physically inaccessible locations, what usually happens is an engineer time is scheduled and they just swap the drive and restart the unit a few weeks later. What am I after is some mechanism to automatically eject the malfunctioning drive so that I can at least ssh to the unit to investigate. – Hackeron Jan 20 '23 at 11:00
  • Could you write a local daemon that detects the io wait spike, then umounts the drive? – Rino Bino Feb 09 '23 at 16:38
  • @RinoBino I tried that but there were too many false positives and also situations where at that point the system is so bogged down it just becomes too late. I was hoping for something maybe kernel level as user space I am struggling to find a solution. – Hackeron Feb 11 '23 at 21:07

0 Answers0