2

I am writing a program that writes to a device's range of HW registers. I am using mmap to map the HW addresses to virtual address (user space). I tested the result from the mmap and it is OK. I implemented a copy of a buffer into the device:

void bufferCopy(void *dest, void *src, const size_t size) {
    uint8_t *pdest = static_cast<uint8_t *>(dest);
    uint8_t *psrc = static_cast<uint8_t *>(src);
    size_t iters = 0, tailBytes = 0;

    /* iterate 64bit */
    iters = (size / sizeof(uint64_t));
    for (size_t index = 0; index < iters; ++index) {
        *(reinterpret_cast<uint64_t *>(pdest)) =
            *(reinterpret_cast<uint64_t *>(psrc));
        pdest += sizeof(uint64_t);
        psrc += sizeof(uint64_t);
    }
.
.
.

but when running it on QEMU I get illegal instruction exception. When I debugged got it crashes on the next instruction (below is the asm of the main loop):

movdqu (%rsi,%rax,1),%xmm0                                                   
movups %xmm0,(%rdi,%rax,1)   <----- this instruction crashes ...                                                  
add    $0x10,%rax                                                            
cmp    %rax,%r9                                                              
jne    0x7ffff7eca1e0 <_ZN12_GLOBAL__N_110bufferCopyEPvS0_m+64>   

any ideas why ? my guess that you can write to PCI only 32/64 bit. The compile doesn’t know my limitations, so it optimize my code and create vectorized loop (each iteration loads 128 bit and saves 128 bit). Is is making sense ?? can I write to PCI with vectorized instructions ?

Also, whether it is a missing feature in QEMU or a bug or just a recommendation, how can I prevent from the compiler to generate those vector instructions ?

yehudahs
  • 2,488
  • 8
  • 34
  • 54
  • #UD can't be caused by the CPU because of anything related to the destination. Perhaps QEMU is emulating the device and this is a limitation in QEMU. – prl Jul 07 '21 at 17:48
  • @prl: I think we've had a previous Q&A about QEMU not supporting vector stores to uncacheable or WC memory regions. Real x86 can, that's an emulator missing feature. – Peter Cordes Jul 07 '21 at 18:27
  • 1
    Unless you specifically know that it makes sense for the device you're accessing to let the compiler generate any random stuff for the memory accesses, it's almost certainly better to exercise more careful control over what instructions get used for device register accesses, though. For instance you usually want to avoid registers being written multiple times, reads and writes being reordered relative to each other, or word-sized accesses being broken up into byte accesses.The Linux kernel, for instance, uses specific accessor functions that it can guarantee turn into plain loads/stores. – Peter Maydell Jul 07 '21 at 18:59
  • how can I prevent from the compiler to generate those vector instructions ? – yehudahs Jul 08 '21 at 06:26
  • Using `volatile` would be one obvious way: each C++ access becomes one asm access. MMIO is the main use-case `volatile` is ideal for. For actual MMIO *registers*, that's what you want. For writing to device *memory*, on real hardware you'd rather use wider stores. – Peter Cordes Jul 08 '21 at 06:34
  • @PeterCordes - the thing is, it is working on some hosts (that run QEMU) and fails in others.... all have sse instructions... (I see sse when running cat /proc/cpuinfo)... – yehudahs Jul 08 '21 at 08:02
  • 2
    Are you compiling different binaries on those different guest machines? Maybe some auto-vectorize, some don't? According to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=202643#c40, QEMU+KVM supports SSE, and advertizes that fact to the guest via CPUID, but *doesn't* actually support it properly when the memory operand is uncacheable. The workaround that Xorg used was to disable SSE instructions for xf86SlowBcopy. – Peter Cordes Jul 08 '21 at 15:24

0 Answers0