nvme device dropouts - I/O 0 QID 0 timeout, controller disabled

Question

We have 6 Supermicro servers all of the same (or very similar spec), Over the last two weeks one of them has been dropped an NVMe disk at random times due to:

[ 66.856719] nvme 0000:03:00.0: I/O 0 QID 0 timeout, disable controller [ 66.957911] nvme 0000:03:00.0: Identify Controller failed (-4) [ 66.957961] nvme 0000:03:00.0: Removing after probe failure status: -5

We have tried:

Swapping the disk
Swapping the NVMe cables
Swapping the NVMe controller (motherboard)
Swapping the backplane
Downgrading from Kernel 4.5.0 to 4.4.2 given recent changes to the storage subsystem
Upgrading disk and motherboard firmwares
Swapping the motherboard

So it's essentially a whole new server except that we haven't done a reinstall - why? Because I want to understand the problem and if reinstalling fixes it we'll never know why it's happening on this machine and not on our other 5.

No SMART or nvme-cli errors are reported on the drive when it is functioning.
If the drive is swapped into another bay it works fine and whatever drive is replaced into that bay then eventually times out / fails.
CentOS 7 (Latest patches installed)
Kernel 4.5.0
2x Intel DC3600 NVMe (2.5" FF)
Intel Corporation C610/X99 series chipset
Full lspci -tvv output: https://gist.github.com/sammcj/8839c536b2cf6d4def8d2572eb1b4e8a
Full kernel config: https://gist.github.com/sammcj/7d1e79775bf984424b92679d16c015c6

I wonder if you got this resolved at the end? what was the issue? — Baruch Even, Nov 19 '17 at 12:38

score 1 · Answer 1 · answered Nov 21 '18 at 10:29

I've had a similar failure with Intel P4600 drives (different from yours), the ruling from Intel for our case was a rare firmware with the action items to replace the specific drives and update the firmware to the latest on all remaining drives. YMMV.

The error you are getting means that the drive is there at the PCIe level and even can be communicated with at some basic NVMe level but it cannot complete full initialization due to an internal assert on the drive (again, based on FA results for our drives, it may differ for you).

score 0 · Answer 2 · answered Apr 14 '16 at 07:33

0

Call Supermicro support or use a completely different server.

You've done more troubleshooting than most would and have definitely followed all of the reasonable steps within your control.

Supermicro equipment is relatively cheap and doesn't provide the same level of polish that a Dell or HP system would have. So take it from someone who's seen large Supermicro deployments at scale... You may just have a dud.

answered Apr 14 '16 at 07:33

ewwhite

197,159
92
443
809

Hi, thank you for your input however I don't think there is ever such a thing as 'just a dud', I'd rather understand the problem than swap it for something else, chances are it's not even supermicros fault, in fact, it looks to be a kernel bug introduced back in 4.4 - http://lists.infradead.org/pipermail/linux-nvme/2016-April/004373.html – s_mcleod Apr 16 '16 at 09:54
Then why is this only impacting _one_ of your servers? Are they not configured identically? – ewwhite Apr 16 '16 at 11:21
1

Years later, and I run into this issue myself... Sorry :( – ewwhite Oct 16 '21 at 02:49

nvme device dropouts - I/O 0 QID 0 timeout, controller disabled

2 Answers2