Troubleshooting frozen disk when queue is full

Question

I have a system (centos 7.8) with k8s installed on top.

After a few days of normal operations (system load is ~30% and disk activity is around 60 iops, not saturated), the system goes to an unstable state where nothing is commited to disk anymore. As can be seen from an iostat -x 5, avgqu-sz freezes no more r/w are accepted.

The question is where should I look next in order to identify the root cause.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00     0,00   169,00    0,00    0,00    0,00   0,00 100,00
dm-0              0,00     0,00    0,00    0,00     0,00     0,00     0,00   186,00    0,00    0,00    0,00   0,00 100,00
dm-1              0,00     0,00    0,00    0,00     0,00     0,00     0,00     7,00    0,00    0,00    0,00   0,00 100,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11,36    0,00    1,71   86,94    0,00    0,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00     0,00   169,00    0,00    0,00    0,00   0,00 100,00
dm-0              0,00     0,00    0,00    0,00     0,00     0,00     0,00   186,00    0,00    0,00    0,00   0,00 100,00
dm-1              0,00     0,00    0,00    0,00     0,00     0,00     0,00     7,00    0,00    0,00    0,00   0,00 100,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11,41    0,00    1,58   87,01    0,00    0,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00    0,00     0,00     0,00     0,00   169,00    0,00    0,00    0,00   0,00 100,00
dm-0              0,00     0,00    0,00    0,00     0,00     0,00     0,00   186,00    0,00    0,00    0,00   0,00 100,00
dm-1              0,00     0,00    0,00    0,00     0,00     0,00     0,00     7,00    0,00    0,00    0,00   0,00 100,00

The last log lines from /var/log/messages before a force system reset were:

kernel: ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20130517/exfield-389)
kernel: ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMM] (Node ffff99c2ba2513c0), AE_AML_BUFFER_LIMIT (20130517/psparse-536)
kernel: ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20130517/power_meter-339)

although, according to this RedHat thread, it should not be an issue https://access.redhat.com/discussions/3871951

LE 1: Occasionally I get similar freezes over small periods of time (less than a minute) and then it recovers. In the dmesg output I have:

[Lu aug 17 21:04:07 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap+ En+ Exp=1
[Lu aug 17 21:04:15 2020] hpsa 0000:06:00.0: device is ready.
[Lu aug 17 21:04:15 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap+ En+ Exp=1

LE 2: Managed to save a dmesg output when the disk doesn't recover anymore and a reset is required.

[Lu aug 24 13:00:18 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap+ En+ Exp=1
[Lu aug 24 13:03:20 2020] INFO: task scsi_eh_0:332 blocked for more than 120 seconds.
[Lu aug 24 13:03:20 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Lu aug 24 13:03:20 2020] scsi_eh_0       D ffff8c603fc9acc0     0   332      2 0x00000000
[Lu aug 24 13:03:20 2020] Call Trace:

So hpsa enters this resetting logical disk-access procedure that never completes.

Ok, I'll try to keep a connection open and get the last logs from dmesg before resetting. Otherwise they don't get persisted. In the meantime I've updated the question with an observation that could correlate. — Laurentiu Soica, Aug 17 '20 at 18:14
@MichaelHampton, I've updated the post with dmesg output when the problem occurs. — Laurentiu Soica, Aug 24 '20 at 10:12

score 2 · Accepted Answer · answered Aug 17 '20 at 18:40

2

Last time I seen such symptom for disk IO stoping/pausing it was a bad disk issue. The disk controller on it were probably starting to malfunction, but the platten were ok.

I would check to be sure you have good backup, and as it's a system in raid check if the scsi controller is up to date as it didnt flagged the disk as bad yet.

answered Aug 17 '20 at 18:40

yagmoth555

16,758
4
29
50

The controller is an HP H240. My version is 6.3. I see the latest is 7.0 so I'll try to update to that. Other that that, the controller and the drives are marked as green. – Laurentiu Soica Aug 17 '20 at 19:16
@LaurentiuSoica Perfect, let me know if that change something. In my example the disk were green too, but a corruption happened into the raid because the faulty disk didnt got flagged by the controller. It's why I asked to make sure you have a backup – yagmoth555 Aug 17 '20 at 19:25
I did upgrade the firmware to latest available. After another ~3 days, the disk froze again and the only option left was a force reset. The disks and controller are still reported as green. – Laurentiu Soica Aug 24 '20 at 09:46
So, few days later, as expected, one of the disks went down. What's still unclear for me is what's to be done during this window to troubleshoot and act faster – Laurentiu Soica Aug 28 '20 at 18:43

Troubleshooting frozen disk when queue is full

1 Answers1