I'm seeing poor memory (WC) read performance with the vmovntdqa
non-temporal load
instruction on Intel Xeon E-2224 systems, but excellent performance on AMD
EPYC 3151 systems. Why such a huge difference, and is there anything I could do
about it? It seems like the instruction is not working at all as expected on
the Intel systems.
I have DDR memory on an FPGA board attached to PCI Express. I'm using mmap()
from user space to access the PCI BAR to which said DDR memory is mapped.
The BAR is marked prefetchable, and as expected, Linux provides the _wc
resource files under sysfs accordingly.
Here are my benchmark results:
System: Dell R240
CPU: Intel Xeon E-2224
Read speed (memcpy()): 13.7 MB/s
Read speed (streaming/NT load): 9.6 MB/s
System: Supermicro M11SDV-4C-LN4F
CPU: AMD EPYC 3151
Read speed (memcpy()): 6.8 MB/s
Read speed (streaming/NT load): 273.6 MB/s
The poor performance also happens on Dell R250 (Intel Xeon E-2314) and SuperMicro X11SCA-F (Intel Xeon E-2224G).
The benchmark program (C++, Linux) was:
/* iobench.cpp */
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <chrono>
#include <cassert>
#include <cstring>
#include <iostream>
/* streaming-load-memcpy.cpp (adapted from Mesa, see below) */
void util_streaming_load_memcpy(void* __restrict__ dst,
void* __restrict__ src,
size_t len);
int main(int argc, char** argv) {
const size_t sz = 4 * 1024 * 1024; /* 4 MB */
assert(argc == 2);
int sysfs_fd = open(argv[1], O_RDWR | O_CLOEXEC);
if (sysfs_fd == -1) assert_perror(errno);
void* src = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_SHARED, sysfs_fd, 0);
if (src == (void*)-1) assert_perror(errno);
if (close(sysfs_fd) == -1) assert_perror(errno);
char* dst = static_cast<char*>(aligned_alloc(4096, sz));
assert(dst != nullptr);
std::chrono::steady_clock::time_point begin;
std::chrono::steady_clock::time_point end;
begin = std::chrono::steady_clock::now();
util_streaming_load_memcpy(dst, src, sz);
// Or: memcpy(dst, src, sz); */
end = std::chrono::steady_clock::now();
float duration_sec = std::chrono::duration<float>(end - begin).count();
float speed = sz / duration_sec / 1024.0 / 1024.0;
std::cout << speed << " MB/s\n";
if (munmap(src, sz) == -1) assert_perror(errno);
}
util_streaming_load_memcpy()
was adapted from
the Mesa project (minor changes are needed to make it compile standalone).
- Compile:
g++ -mavx -O2 iobench.cpp streaming-load-memcpy.cpp -o iobench
- Run like:
./iobench /sys/bus/pci/devices/0000\:13\:00.0/resource2_wc
The streaming load loop of util_streaming_load_memcpy()
compiled to:
...
1540: c4 e2 79 2a 1e vmovntdqa (%rsi),%xmm3
1545: c4 e2 79 2a 56 10 vmovntdqa 0x10(%rsi),%xmm2
154b: 48 83 c6 40 add $0x40,%rsi
154f: 48 83 c0 40 add $0x40,%rax
1553: c4 e2 79 2a 4e e0 vmovntdqa -0x20(%rsi),%xmm1
1559: c4 e2 79 2a 46 f0 vmovntdqa -0x10(%rsi),%xmm0
155f: c5 f9 7f 58 c0 vmovdqa %xmm3,-0x40(%rax)
1564: c5 f9 7f 50 d0 vmovdqa %xmm2,-0x30(%rax)
1569: c5 f9 7f 48 e0 vmovdqa %xmm1,-0x20(%rax)
156e: c5 f9 7f 40 f0 vmovdqa %xmm0,-0x10(%rax)
1573: 48 39 d6 cmp %rdx,%rsi
1576: 75 c8 jne 1540 <_Z26util_streaming_load_memcpyPvS_m+0x40>
...
I have tried the following, but nothing seems to make any meaningful difference:
- Tweak UEFI BIOS settings
- Run a recent stable kernel (6.2.6)
- Disable speculative execution mitigations (kernel cmdline
mitigations=off
) - Update Intel microcode (tests done with microcode 0xf0)
- Use 256-bit NT loads (
vmovntdqa
withymm*
registers) - Check
/sys/kernel/debug/x86/pat_memtype_list
; the memory is listed aswrite-combining
- Use Dell R240 on-board Matrox G200eW3 GPU memory for reading instead of our FPGA DDR
Since there have been numerous CPU vulnerabilities, including "MMIO Stale Data Vulnerabilities", I can't help thinking, is this a result of some mitigations in either the CPU microcode or design? The microcode cannot be downgraded easily below the version loaded by the firmware.