I wanted to measure latency of a very small piece of code. So I added code to have the rdtscp instruction before and after it. The problem is, the latency I measure using that turns out to be 0.
static inline __attribute__((always_inline)) uint64_t rdtscp()
{
uint64_t cycles_high, cycles_low;
asm volatile
("rdtscp\n\t"
"mov %%rdx, %0\n\t"
"mov %%rax, %1\n\t"
: "=r" (cycles_high), "=r" (cycles_low) :: "%rax", "%rbx", "%rcx", "%rdx");
return (cycles_high << 32) + cycles_low;
}
The process is pinned at one particular core so unsynchronized tsc registers for different CPUs can't be the problem. I know that I have not used a serializing instruction like cpuid so the rdtscp instructions could be rearranged in the out of order CPU. However, these should still be two different instructions. And as far as I know, the tsc register is updated every clock cycle. So the values that the two instructions read must not be the same!
The only possible reason that I can think of for this is that the hyperthreaded CPU issues both the instructions at the exact same time. Is it correct?