micro-benchmark to study the latency of movdir64b instruction

Question

I want to study the latency of instruction movdir64b on a system which supports this instruction.

How can I write a simple micro-benchmark to accomplish this?

Note: MOVDIR64B reads 64-bytes from the source memory address and performs a 64-byte direct-store operation to the destination address. [Details: https://www.felixcloutier.com/x86/movdir64b]

Peter Cordes · Answer 1 · 2021-09-02T23:09:21.743

2

(From comments, apparently the actual goal was to measure how soon after a movdir64b can another one execute. That's throughput, not latency. This answers the question asked, about latency, assuming source and destination are cacheable memory regions.)

The store part is NT (like movntps), so you shouldn't use it if latency matters. It will forcibly evict the destination cache line from cache if it was previously present, so a reload will cause a cache miss all the way to DRAM.

If you care about the data being reloaded quickly (by this core), use normal cacheable stores. Or if you care about it being reloaded by another core, it's probably still faster for another core to have to ask this core to share the line (somewhat slower than an L3 cache hit) than to go all the way to DRAM.

Note that the intended use-case is for MMIO writes to PCIe devices. (With another CPU feature, ENQCMD in Sapphire Rapids (server version of Alder Lake / Golden Cove), providing an even better way that lets you know if the write succeeded without running another I/O instruction to check if the work descriptor was submitted successfully. Phoronix article)

You could verify that reload is slow with a simple loop that makes the store and reload part of a loop-carried dependency chain. Using AVX-512 (e.g. on Tiger Lake (Willow Cove core uarch) which has both AVX-512 and movdir64b), you could reload the full data and store it back into the source buffer, creating a loop-carried dependency chain.

Or you could use movdir64b a,b / movdir64b b,a to do 64-byte copies in alternating directions. (And then take the average cycles / iteration for the loop).

   lea  rdi, [rel buf+0]
   lea  rsi, [rel buf+64]
   mov  ecx, 10000000
 .loop:
    movdir64b rdi, [rsi]
    movdir64b rsi, [rdi]
    dec  ecx
    jnz  .loop

(put this in a static executable and time it with perf stat.)

Or you could reload the movdir64b destination and use that load result as the source address for the movdir64b, testing latency from the address input instead of the memory-data input. (Start with the first 8 bytes of the source data holding a pointer to itself.)

edited Sep 02 '21 at 23:09

answered Feb 17 '21 at 05:52

Peter Cordes

328,167
45
605
847

Thanks for your answer. I am new to writing and running assembly codes. So let me step back a little. Can you elaborate the "put this in a static executable" part? What is the easiest way to run this (or any other) assembly code on my system? – jhagk Feb 17 '21 at 06:24
Yes, I am looking at the use-case you pointed out (MMIO writes to accelerator devices). – jhagk Feb 17 '21 at 06:37
@Rajesh: [Can x86's MOV really be "free"? Why can't I reproduce this at all?](//stackoverflow.com/q/44169342) shows some complete examples of source and build / run commands, so does [RDTSCP in NASM always returns the same value (timing a single instruction)](//stackoverflow.com/q/54621381). If you increase the iteration count enough to hide more startup overhead, you can test the same loop from inside a normally-built program, or do manual timing around a call to a function that does this, or whatever if you don't care about using perf counters to time mostly this code, not startup overhead. – Peter Cordes Feb 17 '21 at 06:38
@Rajesh: what latency are you trying to measure, then? From what to what? Source data being ready in memory to what, a network packet being sent? – Peter Cordes Feb 17 '21 at 06:40
I am trying to measure how much time on an average movdir64b instruction takes to read the data from a source location and place the data into the destination. I assume this instruction helps multiple cores to read/write data to/from host/device memory through the CXL, is there any gap in my understanding? – jhagk Feb 17 '21 at 07:12
@Rajesh: Memory accesses (and probably at least some parts of I/O) are pipelined, so the total throughput you can achieve is *not* just a function of latency. It also depends how much one write can overlap with the previous write. Sounds like you actually want to measure throughput for some kind of device write, not latency. – Peter Cordes Feb 17 '21 at 07:19
I see. wrt coding in C/C++, does the programmer has any control over the use of these instructions or it's totally up to the compiler to use/no-use them? If programmer has a control, how can he indicate this in the program? – jhagk Feb 17 '21 at 10:13
@Rajesh: I don't expect a compiler would use them without being told to, although possibly they could be useful as part of a large memcpy on CPUs like Tremont with MOVDIR but not AVX-512, as an alternative to `movntps` stores which some memcpy implementations sometimes use. But anyway, if you want to make a compiler use them, use the intrinsic `_movdir64b(void *dst, const void* src)` documented in the asm manual entry you linked in the question. https://www.felixcloutier.com/x86/movdir64b – Peter Cordes Feb 17 '21 at 10:18

micro-benchmark to study the latency of movdir64b instruction

1 Answers1