how to triage linux disk IO system-wide "hang"

Question

I have a box that is periodically "going out to lunch". Symptoms are anything requiring actual disk IO hang for 30+ seconds, and it appears anything already paged in is unaffected. The problem occurs intermittently, up to a few times an hour), and has not been traceable so far to any running program or user behavior. Re-imaging the box now would be a large disruption, so I hope to isolate the issue and hopefully prove that's unnecessary. Ubuntu 20.04 system with btrfs-on-luks nvme root fs.

user descriptions + log analysis (dmesg and journalctl) do not show any behavior correlated with the issue, other than io-timeout related messages after 10's of seconds that seem to clearly be symptoms/fallout. The box was in use (with no instances of this issue noted) for ~6 months with ubuntu 20.04, before being re-imaged several months ago, so I have that minor datapoint that the issue isn't failing hw. btrfs scrub and bios smart don't report any errors.

using iotop -o live during a reproduction I can see that actual throughput to disk drops to ~zero with the exception of a couple kernel threads [kworker/.*events-power-efficient].

Please recommend next steps to triage / isolate why the IO is hanging.

smart output as requested:

#> smartctl -a /dev/nvme0n1` as requested:
[...]
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    4,339,623 [2.22 TB]
Data Units Written:                 7,525,333 [3.85 TB]
Host Read Commands:                 23,147,319
Host Write Commands:                69,696,108
Controller Busy Time:               1,028
Power Cycles:                       98
Power On Hours:                     3,996
Unsafe Shutdowns:                   25
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

using netdata as recommended by @anx, I am getting close: the following symptom in kernel memory appears 100% correlated with the reproduction in all cases I can see. Over the course of ~1 minute non-reclaimable kernel allocated memory ~doubles. During the time this is all being allocated, ~all IO is hung. IO unblocks sometime on the down-ramp as the memory is being freed. The central peak/doubling is very consistent, the smaller peaks before and after vary a bit.

Further, this behavior happens on an (intermittent) hourly cadence:

working on taking an inventory of cron & systemd tasks running hourly. To see if there are any debug logging that could be enabled, as today there is nothing suspicious at these times in the logs.

`netdata` looks like a fantastic tool, thanks @anx. I'll work on getting a few reproductions with that; a bit of information overload there, but it looks to have an extensive config. one config item i don't see anything to look at file handle churn - am I missing something there? — some bits flipped, Jan 16 '21 at 16:25
answering your specific quesions @anx - your point about symptoms being secondary / after the time when the actual root problem occured is valid; I'll pay attention to that in netdata. on CPU / disk / swap usage - no I have looked but do not see any cpu's pegged. `Swapon -v` shows zero swap usage during this (not surprising, the machine has 66gb ram (CV development machine) but plenty of reproductions on days/boots with non of that ML work is being done. — some bits flipped, Jan 16 '21 at 16:30
Are you sure, there is no hardware problem? Sometimes this happens. Did you check the NVMes blocks. Maybe it hangs for some blocks. Smart isn't really helpful sometimes. Can you make dd if=/dev/zero of=tmpfile to the free disk space or read all block to /dev/null? — TRW, Jan 16 '21 at 16:39
@TRW I'm not SURE there is no HW problem - if it weren't for prior system problem-free usage on the same HW, very similar SW, that would be a very obvious suspect. Still could be the problem, though. `smartctl` (and `nvme`) show 0% use of spares, and no block errors. While if other approaches can't narrow in on the problem I can probably take the machine offline and use a `badblocks` type approach, i am somewhat banking on my understand of nvme-ssd wear leveling, in that it's pretty advanced and I'd see problems show up as remapped bad blocks (?) — some bits flipped, Jan 16 '21 at 17:23
hmm, maybe - I would stress the disc, because either the hardware or some software (controller, driver) may have a problem, I guess. I would try during runtime (because I understood that taking down the machine isn't possible). Otherwise take a hardware check on all levels (RAM, disc), maybe thermal power just to have a look. Also - maybe convert it into a VM and try on other hardware. — TRW, Jan 16 '21 at 17:50
going to take me some time to feel confident with the `netdata` results. when I poke at the system with `fio`, some key params like io ops/sec and MiB/s look reasonable to me in netdata; but some of the other fields look oddly zero, as if no real data is coming in. I need to configure a `netdata` external server + local `ram` (or `save`(?)) config to prevent the local instance from being blocked during a reproduction of this issue. Which, of course per murphy hasn't happened in the last couple hours I've been looking at it:) — some bits flipped, Jan 17 '21 at 23:58
@anx - I think that within the scope of the OP, your suggestion for `netdata` is sufficient for an answer since it's enabled me to identify a specific repeatable lower-level symptom to investigate; post one? — some bits flipped, Jan 22 '21 at 15:15

anx · Accepted Answer · 2021-01-22T17:41:37.387

Patterns (when does the degradation occur?) and correlated metrics (does any other metric sharply drop/spike during the degradation?) are often the quickest path towards identifying the trigger.

Even if the trigger is not responsible for the problem (e.g. if the system stalls on memory pressure, but the mechanism why it does so is more complicated), it is helpful to have a reliable reproduction method, so you can acquire more data.

Triage steps:

Use a tool like netdata to visually determine patterns and/or correlated metrics. Is anything happening with performance metrics right when the problem occurs? The most helpful ones might appear earlier than the performance degradation - the problem you see could very well be the recovery phase of some driver or program misbehaving.
If you cannot spot the trigger and thus not reproduce on purpose, you might still identify a way of determining when the problem will occur again (e.g. on the next big disk write, every day 12:00, ..)
schedule task that ensures you get a snapshot (or one every X time units) of relevant system state during the event (e.g. top or even echo t >/proc/sysrq-trigger ; dmesg). Run the scheduled task manually at least once, so its dependencies are cached

Ideas for common explanations:

bad swap configuration or a swap file on some unsuitable block device stack
a bad hard disk, not necessarily the one containing the root filesystem
some debugging task left behind, such as dropping caches in a cronjob
a database, which during housekeeping tasks consumes ridiculous memory or accesses a huge directory
SSD trim (discard), or lack thereof combined with huge writes making the system wait for the disk to finish

OP here: for anyone reading this, the key point for me was the tool recommendation - give `netdata` a shot - completely free, and provides really high quality dashboards, akin to commercial tools like datadog (I did not have to configure it worked out of the box). Using this I was able to identify a smoking gun to guide next steps. — some bits flipped, Jan 22 '21 at 18:51

how to triage linux disk IO system-wide "hang"

1 Answers1