4

x86_64 has an instruction movdir64b, which to my understanding is a non-temporal copy (well, at least the store is) of 64 bytes (a cache line). AArch64 seems to have a similar instruction st64b, which does an atomic store of the same size. However, the official ARMv9 documentation is not clear about whether st64b, too, is a non-temporal store.

Intel's instruction-set reference documentation for movdir64b is much more detailed, but I'm not far along enough in my studies to fully understand what each memory type protocol represents.

From what I could deduce so far, the x86_64 instruction movntdq is roughly equivalent to stnp, and is write-combining. From that, it seems as if movdir64b is like four of those in one atomic store, hence my guess about st64b.

This is almost certainly an oversimplification of what's really going on (and could be wrong/inaccurate, of course), but it's what could deduce so far.

Could st64b be used as if it were an atomic sequence of four stnp instructions as a non-temporal write of a cache line in this way?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Mona the Monad
  • 2,265
  • 3
  • 19
  • 30
  • Just FYI, the intended use-case for `movdir64b` is reliably creating 64-byte PCIe transactions, [like `enqcmd` but not as good](https://www.phoronix.com/scan.php?page=news_item&px=Linux-Make-Use-Of-ENQCMD). You **can** use it to store to DRAM, in which case the store side of it is the same as other NT stores. But nothing guarantees 64-byte read atomicity so it's non-trivial to take advantage of. And being NT means that it hurts performance to use it on data that other cores are about to read. Possibly useful with NVDIMM persistent memory, if the write atomicity gives persistence atomicity. – Peter Cordes Jan 03 '22 at 04:29
  • (In practice, aligned AVX512 64-byte loads/stores seem to be atomic on current Intel CPUs, so non-portable high-performance code tuned for specific known microarchitectures would consider using that for communication between CPUs.) Normally on x86, PCIe MMIO and device-memory regions would already be mapped WC (i.e. NT-store semantics), like how `movdir64b` always works. Anyway, IDK anything about ARM's `st64b`, but I'd guess that it's also intended for efficient PCIe writes. – Peter Cordes Jan 03 '22 at 04:31
  • My "use case" is more so to have a way to do a "streaming" write (i.e. generate a cache line or two, NT store it, generate some more, and so on, not to be read back soon), so atomicity is not really needed there. My mistake if I worded the question differently. – Mona the Monad Jan 03 '22 at 14:20

1 Answers1

5

The ST64B/ST64BV/ST64BV0 instructions are intended to efficiently add work items to a work queue of an I/O device that supports this interface. When the target address is mapped to an I/O device, the store is translated as a non-posted write transaction, which means that there has to be a completion message that includes a status code as described in the documentation. The ST64B instruction simply discards the status code while the other two store it in the register specified by the Xs operand.

If you look at the pseudocode, these instructions require the target address to be in uncacheable memory:

if acctype == AccType_ATOMICLS64 && memattrs.memtype == MemType_Normal then
    if memattrs.inner.attrs != MemAttr_NC || memattrs.outer.attrs != MemAttr_NC then
        fault.statuscode = Fault_Exclusive;
        return (fault, AddressDescriptor UNKNOWN);

Otherwise, the resulting status code is 0xFFFFFFFF_FFFFFFFF, which, as described in the documentation, indicates that the target address doesn't support atomic 64-byte stores. Note that this is different from the status code 1, which represents failure. This can occur for a number of reasons. For example, the work queue of the target device is full.

My understanding from the pseudocode is that these instructions can be used on normal memory as well as device memory as long as the target address is in uncacheable memory. You should check whether they really work on normal memory experimentally by examining the status code.

These instructions are completely different from ARM's STNP and x86's MOVNTDQ. The corresponding instructions in x86 are MOVDIR64B, ENQCMD, and ENQCMDS. Although there are major differences between the ARM ones and x86 ones. The "mental equivalence" you're making between these instructions is kind of OK if you intend in terms of purpose, not behavior.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95