(From comments, apparently the actual goal was to measure how soon after a movdir64b
can another one execute. That's throughput, not latency. This answers the question asked, about latency, assuming source and destination are cacheable memory regions.)
The store part is NT (like movntps
), so you shouldn't use it if latency matters. It will forcibly evict the destination cache line from cache if it was previously present, so a reload will cause a cache miss all the way to DRAM.
If you care about the data being reloaded quickly (by this core), use normal cacheable stores. Or if you care about it being reloaded by another core, it's probably still faster for another core to have to ask this core to share the line (somewhat slower than an L3 cache hit) than to go all the way to DRAM.
Note that the intended use-case is for MMIO writes to PCIe devices. (With another CPU feature, ENQCMD in Sapphire Rapids (server version of Alder Lake / Golden Cove), providing an even better way that lets you know if the write succeeded without running another I/O instruction to check if the work descriptor was submitted successfully. Phoronix article)
You could verify that reload is slow with a simple loop that makes the store and reload part of a loop-carried dependency chain. Using AVX-512 (e.g. on Tiger Lake (Willow Cove core uarch) which has both AVX-512 and movdir64b), you could reload the full data and store it back into the source buffer, creating a loop-carried dependency chain.
Or you could use movdir64b a,b
/ movdir64b b,a
to do 64-byte copies in alternating directions. (And then take the average cycles / iteration for the loop).
lea rdi, [rel buf+0]
lea rsi, [rel buf+64]
mov ecx, 10000000
.loop:
movdir64b rdi, [rsi]
movdir64b rsi, [rdi]
dec ecx
jnz .loop
(put this in a static executable and time it with perf stat
.)
Or you could reload the movdir64b destination and use that load result as the source address for the movdir64b
, testing latency from the address input instead of the memory-data input. (Start with the first 8 bytes of the source data holding a pointer to itself.)