3

I have a use case where the x86 CPU has to write 64 bytes of data to PCIe slave device whose memory has been mmapp'ed into the user space. As of now, i use memcpy to do that, but it turns out that it is very slow. Can we use the Intel SSE intrinsics like _mm_stream_si128 to speed it up? Or any other mechanism other than using DMA.

The objective is to pack all the 64 bytes into one TLP and send it on the PCI bus to reduce the overhead.

Anil Abraham
  • 41
  • 1
  • 4
  • Are you certain that your `memcpy()` implementation isn't already using SSE instructions? – Jason R Aug 27 '15 at 13:36
  • memcpy is from the standard glibc, so i an little doubtful if uses SSE instructions. – Anil Abraham Aug 27 '15 at 17:11
  • how slow is slow? with or without SSE copying the data in the order of a few cycles. – user3528438 Aug 27 '15 at 17:18
  • 1
    @AnilAbraham: Check by running your code in a debugger and stepping into `memcpy()`, or by disassembling your C library file. You'll find that `memcpy()` is typically quite well-optimized for your platform, as it should be, since it's so frequently used. – Jason R Aug 27 '15 at 17:45

1 Answers1

1

As I understand it, memory mapped I/O doesn't make certain store instructions special. An 8B store from movq mem, xmm is the same as the store from mov mem, r64.

I think if you have 64B to write into MMIO, you should do it with whatever instructions do it most efficiently as its generated, then flush the cache line. Generating a 64B buffer and then doing memcpy (or doing it yourself with four movdqa, or two AVX vmovdqa) is a waste of time, unless you expect your code that generates the 64B to be slow and more likely to be interrupted part way through than memcpy. A timer interrupt can come in any time, including during your memcpy, if you're in user space where you can't disable interrupts. Since you can't guarantee complete 64B writes, a 99.99% chance of a full cacheline write vs. a 99.99999% chance prob. won't make a difference.

Streaming stores to the MMIO region might avoid the CPU doing a read-for-ownership after the clflush from the previous write. clwb isn't available yet, so the only option is clflush, which evicts the data from cache.


Non-temporal load/stores are so-called weakly-ordered. IDK if that means you'd need more fencing to guarantee ordering.

One use-case for streaming loads/stores is copying from uncacheable memory, like video RAM. I'm not sure about using them for MMIO. I found this article about it, talking about how to read from MMIO without just getting the same cached value.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • update: yes you do want to make sure you use NT stores for write-combining, if the MMIO region is WC instead of just uncacheable. Otherwise for UC memory, the wider vectors you use the better. (It might or might not be worth saving/restoring FPU state so you can use SSE/AVX in a kernel driver.) – Peter Cordes Feb 26 '19 at 20:38