Linus/ext4/nvme crashes during high io

Question

During mvn compilation, I have random crashes.

The problem seems related to high IO and in kern.log, I can see things like:

kernel: [158430.895045] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
kernel: [158430.951331] blk_update_request: I/O error, dev nvme0n1, sector 819134096 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
kernel: [158430.995307] nvme nvme1: Removing after probe failure status: -19
kernel: [158431.035065] blk_update_request: I/O error, dev nvme0n1, sector 253382656 op 0x1:(WRITE) flags 0x4000 phys_seg 127 prio class 0
kernel: [158431.035083] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:309: I/O error 10 writing to inode 3933601 (offset 16777216 size 2101248 starting block 31672832)
kernel: [158431.035085] Buffer I/O error on device nvme0n1p1, logical block 31672320
kernel: [158431.035090] ecryptfs_write_inode_size_to_header: Error writing file size to header; rc = [-5]

To replicate the error, I use:

stress-ng --all 8  --timeout 60s --metrics-brief --tz

I've tried some boot options, like adding acpiphp.disable=1 pcie_aspm=off to /etc/default/grup, this seemed to help stress-ng test, but not my compilation.

Distribution: Ubuntu 19.10
Kernel: 5.3.0-45-generic #37-Ubuntu SMP Thu Mar 26 20:41:27 UTC 2020

nvme list shows:

Node             SN                   Model                            Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     28FF72PTFQAS         KXG50ZNV256G NVMe TOSHIBA 256GB          1        256,06  GB / 256,06  GB    512   B +  0 B   AADA4102
/dev/nvme1n1     37DS103NTEQT         THNSN5512GPU7 NVMe TOSHIBA 512GB         1         512,11 GB / 512,11  GB    512   B +  0 B   57DC4102

do you mean hw error? I ran dells builtint diagnostics, but it reported no errors. Could it still be a hw problem? — Brimstedt, Apr 07 '20 at 09:24
well those tools are not always correct, I've seen a kingston drive that is readonly due to wear, and the kingston tool shows health OK. :) The log show errors on the drive, nvme0 though not nvme1 like previously suggested. you can try reading smart parameters with `nvme smart-log /dev/nvme0` and see what it will show — bocian85, Jun 15 '21 at 09:13
also please provide info if those disks are connected directly to the motherboad or via some kind of adapter ? I've also seen those fail and cause problems. — bocian85, Jun 15 '21 at 09:14

score 3 · Answer 1 · answered Apr 07 '20 at 18:49

I can't exactly tell you where the problem is as this is just a "generic failure" somewhere in NVMe subsystem. But I can suggest what you can try to pinpoint the problem.

Try adding nvme_core.default_ps_max_latency_us=5500 kernel boot option.
Install nvme-cli package (or even better build a most recent one from sources) and check various logs with it, like smart-log and error-log. That might help to diagnose error further.
Try booting some other distros (live) and stress test under them to see if this is kernel version / distro related. Systemrescuecd distro might be a good starting point.
If that doesn't helps you can try updating your MB firmware ("BIOS", which is not BIOS in fact now with UEFI) to a most recent one. While this doesn't sound obvious and even the patch notes might not have anything directly related to NVMe/PCI-E subsystems, sometimes it helps (practical knowledge).
Update your NVMe drive firmware. Look for a vendor supplied tools and manual for this.
If everything above won't help or give any clues you might have faced yet unknown bug or hardware failure.

Tried adding max_latency_us (how can i verify that its enabled?) but it id not help. smart and error logs show nothing wrong. — Brimstedt, Apr 08 '20 at 18:40

score 2 · Answer 2 · answered Jun 17 '21 at 20:06

The line kernel: [158430.895045] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 means that the NVMe disk controller was not responding and was reset by the NVMe driver to recover communication with the device.

Such issues can be caused by:

malfunctioning hardware
spurious power (ie: bad PSU)
too aggressive PCIe Active State Power Management (ASPM)

Putting aside bad hardware, you can try disabling ASPM with the kernel boot command line pcie_aspm=off

Brimstedt · Answer 3 · 2020-04-08T19:32:54.477

0

I noticed that the errors only occurred on one of the ssd's, the one containing /home

Moved /home to the other disk in the machine, and so far it seems to be working much better.

edited Apr 08 '20 at 19:32

answered Apr 06 '20 at 07:24

Brimstedt

151
1
12

Did you check the temperature of your NVMe? We've had Samsung EVO running and they were performing really bad with High I/O because of overheating. We've made custom cooler for them, which resolved the issue. – Stuka May 10 '20 at 14:07

score 0 · Answer 4 · answered Jun 15 '21 at 08:50

I was having similar issues on my setup and could not find answers anywhere. What eventually turned out to be the culprit was the BIOS power saving settings.

Same as OP, I was under the impression that the error arose because of high I/O, but it seems like it was rather the hardware going into lower power and performance mode after some time.

So if you come across this issue, take a look at your BIOS Power settings and turn the knobs and maybe this problem will go away for you too.

Good luck :)

score -2 · Answer 5 · answered Apr 07 '20 at 19:02

fast thing to just try is hot-swap the harddrive driver.

but for performance IO, u can't go cheap also. Check max latency, see how much your going over. maybe ur just trying something that demands a better driver with the kernel.

look in some cmake config or some compiler agruement to use only 1 thread or less IO, slow it down somehow, if you can use the terminal to pause the process manually, u might be able to simulate a compile, if your very desperate,

only other thing that can be done quick is make VM machine of that machine, and compile it on VM, and debug it on live.

Linus/ext4/nvme crashes during high io

5 Answers5