3
  1. As you know, starting with version 2.0, PCI Express supports compound atomic operations: FetchAdd, Swap, CAS: https://pcisig.com/sites/default/files/specification_documents/ECN_Atomic_Ops_080417.pdf

  2. Also known, that x86_64 CPU has assembler compound atomic operations: lock add, [lock] xchg, lock cmpxchg: https://godbolt.org/g/MmqMRw

That can be produced by C-compiler used volatile atomic_int-operations:

int expceted_cas = 0;
volatile atomic_int a;

atomic_fetch_add( &a, 1 );
atomic_exchange( &a, 1 );
atomic_compare_exchange_weak( &a, &expceted_cas, 1 );

I want to access to the buffer memory on device (Ethernet, GPU, ...) that connected by PCI Express to PC-x86_64, by using compound atomic operations. I.e. we already know how works hardware bus (PCIe supports atomics FetchAdd/Swap/CAS), but we want to know what assembler source code required to use this PCIe features.

Can we use x86_64 CPU compound atomic operations: lock add, [lock] xchg, lock cmpxchg to generate on PCI Express the compound atomic operations: FetchAdd, Swap, CAS?

Or what asm-code should we use on x86_64 CPU to perform atomic operations FetchAdd, Swap, CAS on PCI Express 2.0/3.0?

Alex
  • 12,578
  • 15
  • 99
  • 195

1 Answers1

3

For what I can gather from the Internet, the latest generations of Intel CPUs at the time of writing [1] [2] [3] only support PCIe AtomicOps as completers.

The PCIe devices integrated into the uncore can complete an AtomicOp but cannot request one, the PCIe ports can request an AtomicOp but that's possibly just for forwarding device initiated requests.

It seems that the PCI root complex is unable to request AtomicOps.
Enabling AtomicOps would require a tight coupling between the processor and the root complex: not only the processor has to transmit the type of operation it is performing - thereby implementing a mapping between x86 instructions and PCIe AtomicOps - but also its operands.
Furthermore, the root complex must be able to identify when a write targets an AtomicOps enabled device among all the possible destinations - thereby requiring a set of software configurable address ranges.
Finally, AtomicOps need to be handled specially by the QPI Quiesce Master - since the target device is already taking care of the atomicity, a global QPI lock can be avoided.
All of this, of course, assuming that the target memory is not cacheable (or a cache lock would take place instead).

I don't think these are insurmountable obstacles rather I believe that AtomicOps were invented primarily to shorten the latency of an IO->HostMem atomic write or an IO->IO write.
Looking at what Intel wrote:

Today, message-based transactions are used for PCIe devices, and these use interrupts that can experience long latency, unlike CPU updates to main memory that use atomic transactions.

it seems that the primary concern is the use of an interrupt to notify a device driver that an atomic write must be performed on behalf of its managed device.

Host->IO AtomicOps are allowed but It seems they can't be generated as today, surely not with a lock prefix alone.
I also believe that issuing an AtomicOps to a device from the processor would only be useful to perform a write that is atomic with respect to other PCIe devices as the processors usually synchronise themselves with locks.

Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124