1

I wrote a basic code to find out the amount of clock cycles taken by nop. We know nop takes one clock cycle.

#include <stdio.h>
#include <string.h>
#include <stdint.h>


int main(void)
{
    uint32_t low1, low2, high1, high2;
    uint64_t timestamp1, timestamp2;
    asm volatile ("rdtsc" : "=a"(low1), "=d"(high1));
    asm("nop");
    asm volatile ("rdtsc" : "=a"(low2), "=d"(high2));
    timestamp1 = ((uint64_t)high1 << 32) | low1; 
    timestamp2 = ((uint64_t)high2 << 32) | low2; 
    printf("Diff:%lu\n", timestamp2 - timestamp1);
    return 0;
}

But the output is not 1.

It is sometimes 14 or 16.

May i know the reason behind this. Am i missing anything

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
md.jamal
  • 4,067
  • 8
  • 45
  • 108
  • 9
    For one thing, your timing loop also includes the time to execute 1 `rdtsc` instruction. – 1201ProgramAlarm Jan 12 '20 at 01:30
  • What compiler options are you using? – Sedat Kapanoglu Jan 12 '20 at 01:34
  • Normal: gcc userprog.c -o userprog – md.jamal Jan 12 '20 at 01:35
  • The `constant_tsc` flag in `/proc/cpuinfo` would indicate that you're measuring time, not cycles. You probably want to send a serialisation instruction to prevent out-of-order execution. Have you set CPU affinity to a single core? – LegendofPedro Jan 12 '20 at 01:47
  • 2
    How do you know `nop` takes one cycle? Some processors can remove several from the instruction stream each cycle, so they are never dispatched and consume no execution time. – Eric Postpischil Jan 12 '20 at 01:49
  • You probably want to compile with `-O0` to disable optimisation (and maybe `-S` to verify the assembly output). – LegendofPedro Jan 12 '20 at 01:50
  • 4
    @LegendofPedro: no, `-O0` would just put more garbage in the timed interval, but still wouldn't make the 2nd RDTSC wait for completion of earlier instructions. Or stop the first RDTSC from running early as well. See my canonical answer about RDTSC: [How to get the CPU cycle count in x86\_64 from C++?](//stackoverflow.com/a/51907627) – Peter Cordes Jan 12 '20 at 02:07
  • 1
    *We know nop takes one clock cycle.* **[What kind of chip you got in there, a Dorito?](https://www.youtube.com/watch?v=qpMvS1Q1sos)** Seriously though, what CPU did you test this on, just so the answer can include the details in an explanation of base / reference frequency (TSC) vs actual core clock cycles, assuming `constant_tsc`? Surely not a 486 or earlier where NOP would actually cost 1 cycle. – Peter Cordes Jan 12 '20 at 02:16
  • @PeterCordes that's interesting, I would expect optimisation to not add anything (or do much) to inline asm, apart from maybe removing the `nop`. – LegendofPedro Jan 12 '20 at 02:18
  • 1
    @LegendofPedro: Right, exactly, you want optimized asm. And no, GCC/clang don't "understand" the asm template, they only scan it for `"%number"` operand substitutions before feeding the result (including the compiler-generated asm) to the assembler. With `-O0` you'd get stores to stack space for `low1` and `low2`, instead of just `mov` to other registers (or maybe `shl`/`lea` into another register before the 2nd rdtsc). On 2nd look, you wouldn't actually get more instructions in the (attempt at a) "timed region" from `-O0` because there's nothing to spill/reload; no inputs for 2nd asm – Peter Cordes Jan 12 '20 at 02:24

1 Answers1

2

We know nop takes one clock cycle.

A modern CPU can be thought of as a pipeline of stages; where the front end might fetch and decode multiple instructions in parallel and put the resulting micro-ops into a buffer where they wait for their dependencies to be satisfied (before being taken by an execution unit, where multiple micro-ops can be executed at the same time by multiple execution units).

A NOP has no micro-ops - it's simply discarded by the front end. It doesn't cost 1 cycle.

But the output is not 1.

It probably takes 14 or 16 cycles for the instructions the compiler generates to deal with the outputs of the first rdtsc, then prepare things for the second rdtsc, then the second rdtsc itself.

Note that rdtsc probably counts the cycles of a fixed frequency timer that has nothing the CPU's current (variable) clock frequency; so 14 or 16 "time cycles" might be (e.g.) 7 or 8 CPU cycles.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • 2
    14 cycles is actually low for back-to-back `rdtsc` (with only a `mov ecx, eax` in between). Note that `rdtsc` does *not* wait for previous instructions to have finished executing before it executes, and it has no inputs so its microcode can start executing as soon as there's a free execution unit. – Peter Cordes Jan 12 '20 at 02:03
  • Skylake RDTSC throughput is one per 24 core clock cycles (https://agner.og/optimize), and Ryzen is 36 cycles. So the OP's CPU is presumably turboing significantly *above* the "reference" frequency of the TSC. Unless it's a K8 at idle, although K8 probably doesn't have `constant_tsc`. Anyway, see also [How to get the CPU cycle count in x86\_64 from C++?](//stackoverflow.com/a/51907627) for lots more details about `rdtsc` – Peter Cordes Jan 12 '20 at 02:10
  • Re: cost of a `nop`: it costs nothing if you're not bottlenecked on the front end, otherwise it could increase the total cost of decode + issue of a group of instructions by 1/4 or 1/5 of a cycle, or more if it causes different alignment in a problematic way. It's not actually discarded by the front-end, though; it takes a space in the ROB (1 fused-domain uop), but doesn't need an execution unit (0 unfused). You can think of it like the front-end inserting it into the back-end in an "already executed" state, like eliminated `mov` and (on Sandybridge-family) xor-zeroing. – Peter Cordes Jan 12 '20 at 02:14
  • Executing NOPs is not an important enough performance issue to be worth special casing it earlier in the front-end to save front-end issue bandwidth. I assume it would complicate a bunch of corner cases to actually do that, even if we accept that performance counters would no longer count it. – Peter Cordes Jan 12 '20 at 02:20
  • constant_tsc is present in /proc/cpuinfo. I am running this on vmware – md.jamal Jan 12 '20 at 02:34