I am trying to measure the latency of multiple memory accesses that are executing in parallel in an out-of-order processor.
The problem is that any attempt to measure the latency of a load serializes it with respect to other loads.
Take for example a naively written code that measures the latency of two loads:
1. rdtscp
2. load-1
3. rdtscp
4. rdtscp
5. load-2
6. rdtscp
In the above code, the ordering property of rdtscp in Intel's x86 serializes the execution of load-1 and load-2 as per my testing (i.e. load-2 is issued to the memory-system only after load-1 completes execution). As a result, the above code does not utilize the available memory bandwidth. Ideally, I would like to ensure the maximum throughput for the loads, while measuring the latency of each load independently.
Is there a way to measure latency of load-1 and load-2, while allowing them to execute in parallel?
Ideally, what I need is a form of rdtscp that is ordered with respect to the load whose latency is being measured, and not ordered explicitly with any other instruction. I was wondering if there is a way to obtain this either with rdtscp or rdtsc.