1

I have some strange IO activity on a server and I can't figure our where it's coming from.

To provide some background, I had to replace an NVMe (Samsung PM81) from a server due to wear. I didn't notice any performance issues, but SMART reported it was time to get a replacement. I did notice some unusual IO activity on the device, but I thought maybe it was due to the device's wear and didn't think much of it.

Now, with a brand new NVMe (Samsung 980 Pro) and an OS installed from scratch (Debian 10), the IO activity issue still persist.

Here are the contents of /proc/diskstats over a period of 1 minute:

$ cat /proc/diskstats; sleep 1m; cat /proc/diskstats 
 259       0 nvme0n1 2323590 271 213032732 285413 43708052 69809516 16770577066 269903507 0 901057472 1159862364 0 0 0 0
 259       1 nvme0n1p1 2006 0 7264 3665 2 0 2 0 0 44 3080 0 0 0 0
 259       2 nvme0n1p2 74879 0 5283682 9424 2001773 386508 28620456 971285 0 455348 825152 0 0 0 0
 259       3 nvme0n1p3 2246597 271 207737634 272318 40382341 69423008 16741956608 266611966 0 12038708 266043996 0 0 0 0
 259       0 nvme0n1 2323590 271 213032732 285413 43710868 69817259 16771166530 269907653 0 901114568 1159920624 0 0 0 0
 259       1 nvme0n1p1 2006 0 7264 3665 2 0 2 0 0 44 3080 0 0 0 0
 259       2 nvme0n1p2 74879 0 5283682 9424 2002019 386548 28623272 971330 0 455376 825180 0 0 0 0
 259       3 nvme0n1p3 2246597 271 207737634 272318 40384852 69430711 16742543256 266615967 0 12041324 266047732 0 0 0 0

As you can see it reports nvme0n1 over 95 % of the time doing IO ((901114568-901057472)/60000*100)... but the IO usage on the partitions is next to nothing. Where is the IO being done, then? On the partition table? Also the time spent reading (0 ms) plus the time spend writing (4146 ms) does not add up to the time spent doing I/O (57096 ms). What else is there to do but read and write?

There aren't any more partitions or unallocated space on the device:

$ echo p | sudo fdisk /dev/nvme0n1

Welcome to fdisk (util-linux 2.33.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: Samsung SSD 980 PRO 2TB                 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 0B698EB9-DD2E-4131-9730-4193DD9D5FB5

Device             Start        End    Sectors  Size Type
/dev/nvme0n1p1      2048    1953791    1951744  953M EFI System
/dev/nvme0n1p2   1953792  197265407  195311616 93.1G Linux filesystem
/dev/nvme0n1p3 197265408 3907028991 3709763584  1.7T Linux filesystem

Command (m for help): 

SMART also reports an error, but if I understand it correctly it is simply reporting a missing feature on the device, and not a functional issue:

$ sudo smartctl -a /dev/nvme0n1
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-21-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 PRO 2TB
Serial Number:                      S69ENL0T610188X
Firmware Version:                   5B2QGXA7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            1,736,883,855,360 [1.73 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 b621a0ae58
Local Time is:                      Tue Sep 27 10:47:54 2022 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.49W       -        -    0  0  0  0        0       0
 1 +     4.48W       -        -    1  1  1  1        0     200
 2 +     3.18W       -        -    2  2  2  2        0    1000
 3 -   0.0400W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    214,200 [109 GB]
Data Units Written:                 16,891,230 [8.64 TB]
Host Read Commands:                 2,350,427
Host Write Commands:                42,643,472
Controller Busy Time:               238
Power Cycles:                       1
Power On Hours:                     262
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               55 Celsius

Read Error Information Log failed: NVMe Status 0x02

I also checked iotop, but I couldn't see anything relevant:

$ sudo iotop -aoPb -n 2 -d 60
unable to set locale, falling back to the default locale
Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
Total DISK READ:         0.00 B/s | Total DISK WRITE:        36.82 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      46.88 K/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
  649 be/3 root          0.00 B     84.00 K  0.00 %  0.07 % [jbd2/nvme0n1p3-]
  396 be/3 root          0.00 B     40.00 K  0.00 %  0.05 % [jbd2/nvme0n1p2-]
31590 be/4 root          0.00 B      0.00 B  0.00 %  0.00 % [kworker/u48:0-flush-259:0]
 4761 be/4 root          0.00 B      2.02 M  0.00 %  0.00 % minio server /data
  733 be/4 root          0.00 B     12.00 K  0.00 %  0.00 % dcgm-exporter
  737 be/4 root          0.00 B      8.00 K  0.00 %  0.00 % nscd

I guess this means the IO is being performed by the kernel itself?

Can anybody help me figure out what is causing this IO activity and how to avoid it? I wouldn't like this NVMe to wear out soon and need replacing again.

1 Answers1

0

Finally solved the mystery!

It appears there is a bug in the Linux kernel. It makes diskstats report wrong metrics for some storage devices.

I upgraded the kernel to 5.10.0 (the one available on buster-backports) and the metrics are now correct.

The issue can be replicated on AWS using t2.micro and t3.micro instances with Debian 10, in case someone is interested.