Poor MMIO performance with non-temporal loads on Intel Xeon

Question

I'm seeing poor memory (WC) read performance with the vmovntdqa non-temporal load instruction on Intel Xeon E-2224 systems, but excellent performance on AMD EPYC 3151 systems. Why such a huge difference, and is there anything I could do about it? It seems like the instruction is not working at all as expected on the Intel systems.

I have DDR memory on an FPGA board attached to PCI Express. I'm using mmap() from user space to access the PCI BAR to which said DDR memory is mapped. The BAR is marked prefetchable, and as expected, Linux provides the _wc resource files under sysfs accordingly.

Here are my benchmark results:

System: Dell R240
CPU: Intel Xeon E-2224
Read speed (memcpy()): 13.7 MB/s
Read speed (streaming/NT load): 9.6 MB/s

System: Supermicro M11SDV-4C-LN4F
CPU: AMD EPYC 3151
Read speed (memcpy()): 6.8 MB/s
Read speed (streaming/NT load): 273.6 MB/s

The poor performance also happens on Dell R250 (Intel Xeon E-2314) and SuperMicro X11SCA-F (Intel Xeon E-2224G).

The benchmark program (C++, Linux) was:

/* iobench.cpp */

#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <chrono>
#include <cassert>
#include <cstring>
#include <iostream>

/* streaming-load-memcpy.cpp (adapted from Mesa, see below) */
void util_streaming_load_memcpy(void* __restrict__ dst,
                                void* __restrict__ src,
                                size_t len);

int main(int argc, char** argv) {
    const size_t sz = 4 * 1024 * 1024; /* 4 MB */
    assert(argc == 2);

    int sysfs_fd = open(argv[1], O_RDWR | O_CLOEXEC);
    if (sysfs_fd == -1) assert_perror(errno);

    void* src = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_SHARED, sysfs_fd, 0);

    if (src == (void*)-1) assert_perror(errno);
    if (close(sysfs_fd) == -1) assert_perror(errno);

    char* dst = static_cast<char*>(aligned_alloc(4096, sz));
    assert(dst != nullptr);

    std::chrono::steady_clock::time_point begin;
    std::chrono::steady_clock::time_point end;

    begin = std::chrono::steady_clock::now();
    util_streaming_load_memcpy(dst, src, sz);
    // Or: memcpy(dst, src, sz); */
    end = std::chrono::steady_clock::now();

    float duration_sec = std::chrono::duration<float>(end - begin).count();
    float speed = sz / duration_sec / 1024.0 / 1024.0;
    std::cout << speed << " MB/s\n";

    if (munmap(src, sz) == -1) assert_perror(errno);
}

util_streaming_load_memcpy() was adapted from the Mesa project (minor changes are needed to make it compile standalone).

Compile: g++ -mavx -O2 iobench.cpp streaming-load-memcpy.cpp -o iobench
Run like: ./iobench /sys/bus/pci/devices/0000\:13\:00.0/resource2_wc

The streaming load loop of util_streaming_load_memcpy() compiled to:

...
1540:       c4 e2 79 2a 1e          vmovntdqa (%rsi),%xmm3
1545:       c4 e2 79 2a 56 10       vmovntdqa 0x10(%rsi),%xmm2
154b:       48 83 c6 40             add    $0x40,%rsi
154f:       48 83 c0 40             add    $0x40,%rax
1553:       c4 e2 79 2a 4e e0       vmovntdqa -0x20(%rsi),%xmm1
1559:       c4 e2 79 2a 46 f0       vmovntdqa -0x10(%rsi),%xmm0
155f:       c5 f9 7f 58 c0          vmovdqa %xmm3,-0x40(%rax)
1564:       c5 f9 7f 50 d0          vmovdqa %xmm2,-0x30(%rax)
1569:       c5 f9 7f 48 e0          vmovdqa %xmm1,-0x20(%rax)
156e:       c5 f9 7f 40 f0          vmovdqa %xmm0,-0x10(%rax)
1573:       48 39 d6                cmp    %rdx,%rsi
1576:       75 c8                   jne    1540 <_Z26util_streaming_load_memcpyPvS_m+0x40>
...

I have tried the following, but nothing seems to make any meaningful difference:

Tweak UEFI BIOS settings
Run a recent stable kernel (6.2.6)
Disable speculative execution mitigations (kernel cmdline mitigations=off)
Update Intel microcode (tests done with microcode 0xf0)
Use 256-bit NT loads (vmovntdqa with ymm* registers)
Check /sys/kernel/debug/x86/pat_memtype_list; the memory is listed as write-combining
Use Dell R240 on-board Matrox G200eW3 GPU memory for reading instead of our FPGA DDR

Since there have been numerous CPU vulnerabilities, including "MMIO Stale Data Vulnerabilities", I can't help thinking, is this a result of some mitigations in either the CPU microcode or design? The microcode cannot be downgraded easily below the version loaded by the firmware.

Could it be the old issue of PCIe through chipset vs direct? So, a motherboard issue rather than CPU. — Simon Goater, Apr 10 '23 at 09:12
@SimonGoater I checked, but at least on the affected SuperMicro X11SCA-F, it was directly to CPU. Dell documentation is not very explicit on this topic. — Jaakko Salo, Apr 10 '23 at 13:48

Poor MMIO performance with non-temporal loads on Intel Xeon

0 Answers0