0

I wanted to measure latency of a very small piece of code. So I added code to have the rdtscp instruction before and after it. The problem is, the latency I measure using that turns out to be 0.

static inline __attribute__((always_inline)) uint64_t rdtscp()
{
      uint64_t cycles_high, cycles_low;
      asm volatile
        ("rdtscp\n\t"
         "mov %%rdx, %0\n\t"
         "mov %%rax, %1\n\t"
         : "=r" (cycles_high), "=r" (cycles_low) :: "%rax", "%rbx", "%rcx", "%rdx");          

      return (cycles_high << 32) + cycles_low;
}

The process is pinned at one particular core so unsynchronized tsc registers for different CPUs can't be the problem. I know that I have not used a serializing instruction like cpuid so the rdtscp instructions could be rearranged in the out of order CPU. However, these should still be two different instructions. And as far as I know, the tsc register is updated every clock cycle. So the values that the two instructions read must not be the same!

The only possible reason that I can think of for this is that the hyperthreaded CPU issues both the instructions at the exact same time. Is it correct?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Could you show the code? – harold Oct 19 '18 at 12:41
  • @harold Added the code. This method is called before and after the piece of code whose latency I wish to measure. – Priyank Palod Oct 19 '18 at 12:49
  • This is a pretty poor way to invoke rdtscp. Doesn't gcc have a `__builtin_ia32_rdtscp`? – David Wohlferd Oct 19 '18 at 19:05
  • 1
    Can you include a [mcve] where you use this and find 0 cycles? Is it always exactly 0, or is there some noise in the measurement and it's only sometimes 0? If it's always exactly zero, that's probably a bug. `rdtscp` is more than 4 uops, and its one-way serialization barrier effect means it shouldn't be possible for two `rdtscp` instructions to execute in the same cycle as each other. (On Skylake, Agner Fog measured it at 22 uops, with one per 32 cycle throughput.) And BTW, Hyperthreading can't be the issue for a single thread, it's superscalar / out-of-order single threaded execution. – Peter Cordes Oct 22 '18 at 05:23
  • You did use `asm volatile`, so the compiler can't reuse the same result. (Your `mov` instructions are unnecessary, though. Just use `"=d"` and `"=a"` constraints to tell the compiler where the results are, And `rdtscp` doesn't touch RBX. Or better just use the builtin / intrinsc.) – Peter Cordes Oct 22 '18 at 05:29

0 Answers0