As I understand it, memory mapped I/O doesn't make certain store instructions special. An 8B store from movq mem, xmm
is the same as the store from mov mem, r64
.
I think if you have 64B to write into MMIO, you should do it with whatever instructions do it most efficiently as its generated, then flush the cache line. Generating a 64B buffer and then doing memcpy
(or doing it yourself with four movdqa
, or two AVX vmovdqa
) is a waste of time, unless you expect your code that generates the 64B to be slow and more likely to be interrupted part way through than memcpy. A timer interrupt can come in any time, including during your memcpy, if you're in user space where you can't disable interrupts. Since you can't guarantee complete 64B writes, a 99.99% chance of a full cacheline write vs. a 99.99999% chance prob. won't make a difference.
Streaming stores to the MMIO region might avoid the CPU doing a read-for-ownership after the clflush
from the previous write. clwb
isn't available yet, so the only option is clflush
, which evicts the data from cache.
Non-temporal load/stores are so-called weakly-ordered. IDK if that means you'd need more fencing to guarantee ordering.
One use-case for streaming loads/stores is copying from uncacheable memory, like video RAM. I'm not sure about using them for MMIO. I found this article about it, talking about how to read from MMIO without just getting the same cached value.