Bad Linux storage performance, in comparison with Windows on the same machine

Question

(the question was reformulated, I think it needed to be more structured)

We have a Proxmox VE on Dell PowerEdge R610 gen 8 system. The platform is old, but we use it for particular S/W which is well known having no benefits from modern CPU cores, but increases its performance linearly with CPU clock frequency, and 3.3GHz accomplishes the goal well. A performance analysis showed disk I/O is serous bottleneck, while others aren't.

HW config is:

Dell PowerEdge R610 gen 8, BIOS v6.6.0 of 05/22/2018 (most recent), dual PSU - both seem to be OK. Server boots in UEFI.
CPU: 2x Xeon X5680 (Nehalem, 12 cores total, 3.3GHz, boosts up to 3.6 GHz)
RAM: 96 GiB - 6x Samsung M393B2K70DM0-YH9 (DDR3, 16GiB, 1333MT/s)
Storage controller: LSI MegaRAID SAS 9240-4i, JBOD mode (SAS-MFI BIOS, FW v20.10.1-0107 - not the latest one)
Storage: 2x new Samsung SSD 860 EVO 1TB, firmware RVT03B6Q

MegaRAID we use is not the build-in PERC. The built-in was only capable to do only 1.5 Gbit/S SATA which is way too slow, also JBOD or HBA mode are disabled. Unlike that, an added-on 9240-4i runs SSDs on their max interface speed of 6 Gbit/s, and allows for JBOD mode.

The card has no battery and no cache, so it was obvious it has too low performance when RAID was built with it, so both disks are configured as JBOD and used with software RAID. Theoretical maximum for 6 Gbit/s interface is 600 MB/s (considering 8-to-10-bits wire encoding), this is what to expect from a single drive sequential test.

We done extensive i/o tests both under Linux and under Windows, both with fio with same config. The only differences in the config were aio library (windowsaio in Windows, libaio in Linux) and test device specifications. fio config was adapted from this post: https://forum.proxmox.com/threads/pve-6-0-slow-ssd-raid1-performance-in-windows-vm.58559/#post-270657 . I can't show full fio outputs because this will hit ServerFault limit of 30k characters. I can share them somewhere else if somebody wants to see. Here I'll show only summary lines. Linux (Proxmox VE) was configured with MD RAID1 and "thick" LVM.

Caches inside SSDs are enabled:

# hdparm -W /dev/sd[ab]

/dev/sda:
 write-caching =  1 (on)

/dev/sdb:
 write-caching =  1 (on)

Devices run at full 6 Gb/s interface speed:

# smartctl -i /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.3.10-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 860 EVO 1TB
Serial Number:    S4FMNE0MBxxxxxx
LU WWN Device Id: x xxxxxx xxxxxxxxx
Firmware Version: RVT03B6Q
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Feb  7 15:25:45 2020 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

# smartctl -i /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.3.10-1-pve] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 860 EVO 1TB
Serial Number:    S4FMNE0MBxxxxxx
LU WWN Device Id: x xxxxxx xxxxxxxxx
Firmware Version: RVT03B6Q
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Feb  7 15:25:47 2020 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Partitions were carefully aligned to 1 MiB, and the "main" large partition which is LVM PV and where all tests were done starts exactly at 512 MiB:

# fdisk -l /dev/sd[ab]
Disk /dev/sda: 931,5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: Samsung SSD 860 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 1DDCF7A0-D894-8C43-8975-C609D4C3C742

Device       Start        End    Sectors  Size Type
/dev/sda1     2048     524287     522240  255M EFI System
/dev/sda2   524288     526335       2048    1M BIOS boot
/dev/sda3   526336    1048575     522240  255M Linux RAID
/dev/sda4  1048576 1953525134 1952476559  931G Linux RAID


Disk /dev/sdb: 931,5 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: Samsung SSD 860 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 63217472-3D2E-9444-917C-4776100B2D87

Device       Start        End    Sectors  Size Type
/dev/sdb1     2048     524287     522240  255M EFI System
/dev/sdb2   524288     526335       2048    1M BIOS boot
/dev/sdb3   526336    1048575     522240  255M Linux RAID
/dev/sdb4  1048576 1953525134 1952476559  931G Linux RAID

There is no bitmap:

# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md126 : active raid1 sda4[2] sdb4[0]
      976106176 blocks super 1.2 [2/2] [UU]

md127 : active raid1 sda3[2] sdb3[0]
      261056 blocks super 1.0 [2/2] [UU]

unused devices: <none>

LVM is created with 32 MiB PE size, so inside it everything is aligned to 32 MiB.

lsblk --discard shows no device supports any TRIM (even non-queued). This is probably because of LSI2008 chip does not know this command. Queued TRIM is blacklisted on these SSDs: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-core.c?id=9a9324d3969678d44b330e1230ad2c8ae67acf81 . Anyway, this is still the same Windows sees, so comparison is fair.

The I/O scheduler was "none" on both disks. I also tried "mq-deadline" (the default), it showed worse results in general.

Under that configuration, fio showed following results:

PVEHost-128K-Q32T1-Seq-Read  bw=515MiB/s (540MB/s), 515MiB/s-515MiB/s (540MB/s-540MB/s), io=97.5GiB (105GB), run=194047-194047msec 
PVEHost-128K-Q32T1-Seq-Write bw=239MiB/s (250MB/s), 239MiB/s-239MiB/s (250MB/s-250MB/s), io=97.7GiB (105GB), run=419273-419273msec
PVEHost-4K-Q8T8-Rand-Read    bw=265MiB/s (278MB/s), 265MiB/s-265MiB/s (278MB/s-278MB/s), io=799GiB (858GB), run=3089818-3089818msec
PVEHost-4K-Q8T8-Rand-Write   bw=132MiB/s (138MB/s), 132MiB/s-132MiB/s (138MB/s-138MB/s), io=799GiB (858GB), run=6214084-6214084msec
PVEHost-4K-Q32T1-Rand-Read   bw=265MiB/s (278MB/s), 265MiB/s-265MiB/s (278MB/s-278MB/s), io=98.7GiB (106GB), run=380721-380721msec
PVEHost-4K-Q32T1-Rand-Write  bw=132MiB/s (139MB/s), 132MiB/s-132MiB/s (139MB/s-139MB/s), io=99.4GiB (107GB), run=768521-768521msec
PVEHost-4K-Q1T1-Rand-Read    bw=16.8MiB/s (17.6MB/s), 16.8MiB/s-16.8MiB/s (17.6MB/s-17.6MB/s), io=99.9GiB (107GB), run=6102415-6102415msec
PVEHost-4K-Q1T1-Rand-Write   bw=36.4MiB/s (38.1MB/s), 36.4MiB/s-36.4MiB/s (38.1MB/s-38.1MB/s), io=99.8GiB (107GB), run=2811085-2811085msec

On the exactly same hardware configuration, Windows was configured with Logical Disk Manager mirroring. Results are:

WS2019-128K-Q32T1-Seq-Read  bw=1009MiB/s (1058MB/s), 1009MiB/s-1009MiB/s (1058MB/s-1058MB/s), io=100GiB (107GB), run=101535-101535msec
WS2019-128K-Q32T1-Seq-Write bw=473MiB/s (496MB/s), 473MiB/s-473MiB/s (496MB/s-496MB/s), io=97.8GiB (105GB), run=211768-211768msec
WS2019-4K-Q8T8-Rand-Read    bw=265MiB/s (278MB/s), 265MiB/s-265MiB/s (278MB/s-278MB/s), io=799GiB (858GB), run=3088236-3088236msec
WS2019-4K-Q8T8-Rand-Write   bw=130MiB/s (137MB/s), 130MiB/s-130MiB/s (137MB/s-137MB/s), io=799GiB (858GB), run=6272968-6272968msec
WS2019-4K-Q32T1-Rand-Read   bw=189MiB/s (198MB/s), 189MiB/s-189MiB/s (198MB/s-198MB/s), io=99.1GiB (106GB), run=536262-536262msec
WS2019-4K-Q32T1-Rand-Write  bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=99.4GiB (107GB), run=823544-823544msec
WS2019-4K-Q1T1-Rand-Read    bw=22.9MiB/s (24.0MB/s), 22.9MiB/s-22.9MiB/s (24.0MB/s-24.0MB/s), io=99.9GiB (107GB), run=4466576-4466576msec
WS2019-4K-Q1T1-Rand-Write   bw=41.4MiB/s (43.4MB/s), 41.4MiB/s-41.4MiB/s (43.4MB/s-43.4MB/s), io=99.8GiB (107GB), run=2466593-2466593msec

The comparsion:

windows   none     mq-deadline comment
1058MB/s  540MB/s  539MB/s     50% less than Windows, but this is expected
496MB/s   250MB/s  295MB/s     40-50% less than Windows!
278MB/s   278MB/s  278MB/s     same as Windows
137MB/s   138MB/s  127MB/s     almost same as Windows
198MB/s   278MB/s  276MB/s     40% more than Windows
130MB/s   139MB/s  130MB/s     similar to Windows
24.0MB/s  17.6MB/s 17.3MB/s    26% less than Windows
43.4MB/s  38.1MB/s 28.3MB/s    12-34% less than Windows

Linux MD RAID1 only reads from both drives if there are at least two threads. First test is single thread, so Linux will read from a single drive and will achieve a single drive performance. This is justifiable and this first test result is fine. But others...

These only host tests. When we compared what is going when we ran same tests inside VMs, the last lines showed even worse, in Windows VM under PVE (no ballooning fixed memory, fixed CPU frequency, virtio scsi v171, writeback with barriers), it displayed 70% less than under Windows under Hyper-V. Even Linux VM under PVE shows results much worse than Windows under Hyper-V:

                     windows, windows, linux,
                     hyper-v  pve      pve
128K-Q32T1-Seq-Read  1058MB/s 856MB/s  554MB/s
128K-Q32T1-Seq-Write 461MB/s  375MB/s  514MB/s
4K-Q8T8-Rand-Read    273MB/s  327MB/s  254MB/s
4K-Q8T8-Rand-Write   135MB/s  139MB/s  138MB/s
4K-Q32T1-Rand-Read   220MB/s  198MB/s  210MB/s
4K-Q32T1-Rand-Write  131MB/s  146MB/s  140MB/s
4K-Q1T1-Rand-Read    18.2MB/s 5452kB/s 8701kB/s
4K-Q1T1-Rand-Write   26.7MB/s 7772kB/s 10.7MB/s

During these tests, Windows under Hyper-V was quite responsible despite large I/O load, same Linux under PVE. But when Windows ran under PVE, its GUI was slow to crawl, RDP session tended to disconnect itself due to packet drop, and HA on the host was up to 48, which was mostly due to huge i/o wait!

During the test saw quite large load on a single core, which happened to serve a "megasas" interrupt. This card only shows a single interrupt source, so no way to spread this "in hardware". Windows didn't show such single-core load during test, so it seems it uses some kind of interrupt steering (spreads its load on the cores). And overall CPU load was perceived as lower in Windows host test than that in Linux host. This could not be directly compared, however.

The question is: why it sucks so much, am I missing something? Is it possible to have a performance comparable to that of Windows? (I am writing this with shaking hands and lost for words, it is very unpleasant to be catching-up in comparison with Windows.)

Additiontal tests as @shodanshok suggested:

[global]
ioengine=libaio
group_reporting
filename=/dev/vh0/testvol
direct=1
size=5G

[128K-Q1T32-Seq-Read]
rw=read
bs=128K
numjobs=32
stonewall

[128K-Q1T32-Seq-Write]
rw=write
bs=128K
numjobs=32
stonewall

[4K-Q1T32-Seq-Read]
rw=read
bs=4K
numjobs=32
stonewall

[4K-Q1T32-Seq-Write]
rw=write
bs=4K
numjobs=32
stonewall

[128K-Q1T2-Seq-Read]
rw=read
bs=128K
numjobs=2
stonewall

[128K-Q1T2-Seq-Write]
rw=write
bs=128K
numjobs=2
stonewall

The result:

128K-Q1T32-Seq-Read  bw=924MiB/s (969MB/s), 924MiB/s-924MiB/s (969MB/s-969MB/s), io=160GiB (172GB), run=177328-177328msec
128K-Q1T32-Seq-Write bw=441MiB/s (462MB/s), 441MiB/s-441MiB/s (462MB/s-462MB/s), io=160GiB (172GB), run=371784-371784msec
4K-Q1T32-Seq-Read    bw=261MiB/s (274MB/s), 261MiB/s-261MiB/s (274MB/s-274MB/s), io=160GiB (172GB), run=627761-627761msec
4K-Q1T32-Seq-Write   bw=132MiB/s (138MB/s), 132MiB/s-132MiB/s (138MB/s-138MB/s), io=160GiB (172GB), run=1240437-1240437msec
128K-Q1T2-Seq-Read   bw=427MiB/s (448MB/s), 427MiB/s-427MiB/s (448MB/s-448MB/s), io=10.0GiB (10.7GB), run=23969-23969msec
128K-Q1T2-Seq-Write  bw=455MiB/s (477MB/s), 455MiB/s-455MiB/s (477MB/s-477MB/s), io=10.0GiB (10.7GB), run=22498-22498msec

Things are strange, why 128K-Q1T2-Seq-Read was so bad? (The ideal value is 1200MB/s.) 5 GiB per job is too small to have things settled? Everything else seems to be ok.

When you publish fio benchmarks can you include the job/command line parameters? Also the actual fio output can often give clues as to where the time was spent and what has happening (what was the depth being achieved? What were the latencies?). Only seeing summary bandwidth numbers make diagnosis very hard for people on the other side of the internet... — Anon, Feb 12 '20 at 07:09
There is a link to the PVE forum where you can see config I used. I'll repeat: this config https://forum.proxmox.com/threads/pve-6-0-slow-ssd-raid1-performance-in-windows-vm.58559/#post-270657 . The only difference is device spec (filename=) and that we used windowsaio in Windows and libaio in Linux. This all also is mentioned in the question. — Nikita Kipriyanov, Feb 12 '20 at 07:13
So straight off the bat the posixaio ioengine is flexible (works with buffered I/O) but is not super performat - https://github.com/axboe/fio/issues/703#issuecomment-428714558 . If you are able to do I/O against the raw disk I'd recommend at least libaio with direct. Note Windows defaults to using the windowaio ioengine (which is decent)... — Anon, Feb 12 '20 at 07:22
...but you mention you're already using libaio (which I sadly overlooked - this is why it's nice to see the jobfile directly ;-). [Are you also using direct=1 with it](https://serverfault.com/a/918973/203726)? And it would still be helpful to see the full "job finished" output of an fio run... — Anon, Feb 12 '20 at 07:24
Yes, direct=1 was used in the global section of the config, so it was used for each job. I can't simply insert job file into the question text because it will hit a limit of 30000 chars. So only links. Sorry. — Nikita Kipriyanov, Feb 12 '20 at 09:43
Ah yes. I'd copy the inputs/output to some pastebin and link to it... — Anon, Feb 13 '20 at 08:50

score 3 · Answer 1 · answered Feb 07 '20 at 08:06

3

It is quite unliked that you are limited by IRQ service time if using only two SATA disks. Rather, it is very probable that the slow IO speed you see is the direct result of the MegaRAID controller disabling the disk's own, private DRAM caches which, for SSD, are critical to obtain good performance.

If you are using a PERC-branded MegaRAID card, you can enable the disk's private cache via omconfig storage vdisk controller=0 vdisk=0 diskcachepolicy=enabled (I wrote that from memory and only as an example; please check with the omconfig CLI reference

Anyway, be sure to understand what this means: if disk cache is enabled when using consumer (ie: non-power-protected) SSD, any power outage can lead to data loss. If you host critical data, do not enable the disk cache; rather, buy enterprise-grade SSD which cames with powerloss-protected writeback cache (eg: Intel S4510).

If, and only if, your data are expendable, then feel free to enable the disk's internal cache.

Some more reference: https://notesbytom.wordpress.com/2016/10/21/dell-perc-megaraid-disk-cache-policy/

answered Feb 07 '20 at 08:06

shodanshok

47,711
7
111
180

I am updated the question. In general: MegaRAID is not PERC but another one, it is in JBOD mode and SSDs are JBODs, I see them as `sda` and `sdb` directly. There is MD RAID. And still the whole stack is considerably slower than similar Windows config with LDM mirroring, in the SAME MegaRAID configuration (i.e. if it disables caches, it disables them in Windows too). – Nikita Kipriyanov Feb 07 '20 at 08:38
For reference: CPU load was in order of 48 during some tests. The %si was showing up to 50% on that core, and the core also got some of other load. This is why I think it could be that a performance was hindered by the interrupt. – Nikita Kipriyanov Feb 07 '20 at 08:46
So, can you post the output of `hdparm -W /dev/disk` and `smartctl --all /dev/disk` (replacing `disk` with both `sda` and `sdb`)? – shodanshok Feb 07 '20 at 10:15
I updated the question again. According to that, drive-side caching is enabled. The single thread no queue (queue=1 jobnum=1) test in fio (mq-deadline on both disks) shows performance around 34% less than in Windows on same HW, which display latency problems. – Nikita Kipriyanov Feb 10 '20 at 08:04
1

Windows can be slightly faster due to not having a RAID1 bitmap - while `mdadm`, by default (and for good reasons) create it. That said, if no valuable data are on the linux RAID array, can you show the output of `fio --name=test --filename=/dev/ --rw=randwrite --size=1G` (**warning: this will destroy data - don't run it if you have valuable data on the array**)? Please also post the output of `cat /proc/mdstat` and `fdisk -l /dev/sd[ab]` – shodanshok Feb 10 '20 at 14:23
I reformulated a question, because I found it needs clean-up. Sorry if it was rude or inconvenient. I included fio tests we done both in Windows and in Linux, side-by-side. If you need detailed fio output, please ask, as it doesn't fit here. – Nikita Kipriyanov Feb 11 '20 at 10:20
Your `md` array probably has an *internal* bitmap - can you show the output of `mdadm -D /dev/md126`? If no bitmap are present, be sure to grow your array to include one; otherwise, an unclean shutdown can mean a complete array resync. That said, your `fio` benchmark seems in line to what to expect from a Samsung 860 EVO. Anyway, can you try with `noop` scheduler? Also, please do another `fio` run including the `--direct=1` parameter. – shodanshok Feb 11 '20 at 17:02
I know what is write intent bitmap. It was there, but I disabled it at least for the test, the performance is better that way. I know the connsequences. Anyway, the server has 2 PSUs supported by different UPSses and it monitors batteries charge and it will shutdown when they discharge. These fio benchmarks I presented already have direct=1 parameter set in every test (it is defined in the global section of the config). Also, PVE doesn't have a `noop` scheduler available: `cat /sys/block/sda/queue/scheduler` => `[none] mq-deadline kyber bfq`, but `none` I used is a multiqueue noop. – Nikita Kipriyanov Feb 12 '20 at 06:01
If that was "normal performance to expect from 860 EVO", why Windows achieved much better result in the "seq write" (2nd) test and "no queue single thread" (last two) tests? I also updated with tests from inside VMs. I understand this could also come drivers and Qemu, but still they uncover the performance problems with last two tests, which, I think, could come from too large request latencies. Is it possible to do something with this? – Nikita Kipriyanov Feb 12 '20 at 06:05
My bad, I was thinking about 4K random write, while you were speaking about 128K sequential write. In order to exclude any problem related to libaio (used to generate single-thread multi-queue requests) can you use `--numjobs=32` rather than `--iodepth=32`? Does it change anything? That said, please be aware that `noop` is not the same as `none`: the former do some limited (but useful) form of merging, while the latter do no merging at all. The strange thing is that with `--direct=1` the IO scheduler should not matter, while you see some significant variations. – shodanshok Feb 12 '20 at 08:27
I've updated with additional test as you suggested, at the bottom. Btw, are you sure about `direct=1` influence on scheduler? AFAIK, `O_DIRECT` is a caching control, not a scheduler conrol, like this: https://stackoverflow.com/questions/41275161/why-writes-with-o-direct-and-o-sync-still-causing-io-merge , it shouldn't affect scheduler operation. Also, nowhere in the `man open` nothing said about scheduler effects. – Nikita Kipriyanov Feb 13 '20 at 08:20
From your `--numjobs=32` tests, it appears that sequential read/write test almost doubled compared to a single job with 32 QD I/O. The bottom line is that linux libaio implementation is not always faster than "true" multi process read/write. Regarding `O_DIRECT`: using this flag means to bypass all system cache & queue, immediately dispatching the i/o request to the underlying block device. Is the responsibility of the hardware device to coalesce multiple requests. You can check yourself by running an `O_DIRECT` workload while using `iostat -x -k 1` to monitor the `rrqm/s - wrqm/s` fields. – shodanshok Feb 13 '20 at 08:54
I am still confused. Why 2-thread 128K test was so slow then? 128K-Q1T2-Seq-Read bw=427MiB/s – Nikita Kipriyanov Feb 13 '20 at 09:01
Because SSDs need a good amount of parallelism to extract maximum performance. Windows AIO implementation seems somewhat better, in contrived used case as your scenario, then Linux `libaio`. However, this is an entirely different matter, probably not well fitted for serverfault comment system. – shodanshok Feb 13 '20 at 10:43

Bad Linux storage performance, in comparison with Windows on the same machine

1 Answers1

Linked