Is there a way to measure latency of multiple loads in parallel in x86 (using RDTSCP or RDTSC), without serialization?

Question

I am trying to measure the latency of multiple memory accesses that are executing in parallel in an out-of-order processor.

The problem is that any attempt to measure the latency of a load serializes it with respect to other loads.

Take for example a naively written code that measures the latency of two loads:

1. rdtscp
2. load-1
3. rdtscp

4. rdtscp 
5. load-2
6. rdtscp

In the above code, the ordering property of rdtscp in Intel's x86 serializes the execution of load-1 and load-2 as per my testing (i.e. load-2 is issued to the memory-system only after load-1 completes execution). As a result, the above code does not utilize the available memory bandwidth. Ideally, I would like to ensure the maximum throughput for the loads, while measuring the latency of each load independently.

Is there a way to measure latency of load-1 and load-2, while allowing them to execute in parallel?

Ideally, what I need is a form of rdtscp that is ordered with respect to the load whose latency is being measured, and not ordered explicitly with any other instruction. I was wondering if there is a way to obtain this either with rdtscp or rdtsc.

[Description of rdtsc](https://www.felixcloutier.com/x86/rdtscp) suggests that "rdtscp waits until all previous instructions have executed and all previous loads are globally visible". So it is not directly usable for my purpose. — gururaj, Jan 28 '20 at 22:02
[A previous Stackoverflow post](https://stackoverflow.com/questions/52158572/whats-up-with-the-half-fence-behavior-of-rdtscp#comment100016635_52158572) opined that rdtscp enforces full lfence like behaviour: any load instructions on either side of an rdtscp call do not get re-ordered aruond it. I have observed similar behaviour in my testing which prompted this question. — gururaj, Jan 28 '20 at 22:07

score 1 · Answer 1 · answered Jan 28 '20 at 22:46

I don't think there's any way to sample a time with an input-dependency on a specific register, or any other way to let loads complete out of order but still time each one individually. Or even to just let them overlap.

There are perf events for mem_trans_retired.load_latency_gt_32 and so on for powers of 2 from 4 to 512. You could program counters and rdpmc for that. But it wouldn't tell you which load triggered which event.

Given your overall goal, you could use those counters with perf stat or perf record to get an average for a whole loop case when (single-core) memory bandwidth is maxed out.

Note that they count latency from first dispatch (to a load port), not issue into the back-end.

Is there a way to measure latency of multiple loads in parallel in x86 (using RDTSCP or RDTSC), without serialization?

1 Answers1