Is Intel's timestamp reading asm code example using two more registers than are necessary?

Question

I'm looking into measuring benchmark performance using the time-stamp register (TSR) found in x86 CPUs. It's a useful register, since it measures in a monotonic unit of time which is immune to the clock speed changing. Very cool.

Here is an Intel document showing asm snippets for reliably benchmarking using the TSR, including using cpuid for pipeline synchronisation. See page 16:

http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html

To read the start time, it says (I annotated a bit):

__asm volatile (
    "cpuid\n\t"             // writes e[abcd]x
    "rdtsc\n\t"             // writes edx, eax
    "mov %%edx, %0\n\t" 
    "mov %%eax, %1\n\t"
    //
    :"=r" (cycles_high), "=r" (cycles_low)  // outputs
    :                                       // inputs
    :"%rax", "%rbx", "%rcx", "%rdx");       // clobber

I'm wondering why scratch registers are used to take the values of edx and eax. Why not remove the movs and read the TSR value right out of edx and eax? Like this:

__asm volatile(                                                             
    "cpuid\n\t"
    "rdtsc\n\t"
    //
    : "=d" (cycles_high), "=a" (cycles_low) // outputs
    :                                       // inputs
    : "%rbx", "%rcx");                      // clobber

By doing this, you save two registers, reducing the likelihood of the C compiler needing to spill.

Am I right? Or those MOVs are somehow strategic?

(I agree that you do need scratch registers to read the stop time, as in that scenario the order of the instructions is reversed: you have rdtscp, ..., cpuid. The cpuid instruction destroys the result of rdtscp).

Thanks

I'm not expert on GCC inline syntax, but I'd guess that in the second version GCC will generate the *movs* by itself, so it is a matter of readability. Side note: Shouldn't `rdtsc`be *surrounded* by serializing instructions, not just before? I usually use `lfence` in favor of `CPUID` since is it locally serializing and doesn't clobber any register. — Margaret Bloom, Aug 17 '16 at 11:07
I would expect a semi-clever compiler to re-use the output register for the local variable, but I might be wrong. — Edd Barrett, Aug 17 '16 at 11:09
Regarding the `lfence`, do you have a source which demonstrates? — Edd Barrett, Aug 17 '16 at 11:10
True, indeed. Regarding `lfence`, what demonstration are you looking for? `lfence` can be found on the *Intel Manual 2* where is said it is locally serializing. — Margaret Bloom, Aug 17 '16 at 11:16
I was wondering if you had seen `lfence` it in the context of benchmarking with the TSR. I wonder if the `cpuid` calls serve the same purpose... — Edd Barrett, Aug 17 '16 at 11:21
There is something interesting [here](http://akaros.cs.berkeley.edu/lxr/akaros/kern/arch/x86/rdtsc_test.c). — Margaret Bloom, Aug 17 '16 at 12:02
Don't use `rdtsc` to measure CPU time (since context switches can occur at any moment). Use OS specific functions (on Linux, see [time(7)](http://man7.org/linux/man-pages/man7/time.7.html) then use [clock_gettime(2)](http://man7.org/linux/man-pages/man2/clock_gettime.2.html)...) — Basile Starynkevitch, Aug 18 '18 at 15:28

Peter Cordes · Accepted Answer · 2018-08-18T15:04:33.707

You're correct, the example is clunky. Usually if mov is the first or last instruction in an inline-asm statement, you're doing it wrong, and should have used a constraint to tell the compiler where you want the input, or where the output is.

See my GNU C inline asm guides / links collection, and other links in the inline-assembly tag wiki. (The x86 tag wiki is full of good stuff for asm in general, too.)

Or for rdtsc specifically, see Get CPU cycle count? for the __rdtsc() intrinsic, and good inline asm in @Mysticial's answer.

it measures in a monotonic unit of time which is immune to the clock speed changing.

Yes, on CPUs made within the last 10 years or so.

For profiling, it's often more useful to have times in core clock cycles, not wall-clock time, so your microbenchmark results don't depend on power-saving / turbo. Performance counters can do this and much more.

Still, if real time is what you want, rdtsc is the lowest-overhead way to get it.

And re: discussion in comments: yes cpuid is there to serialize, making sure that rdtsc and following instructions can't begin executing until after CPUID. You could put another CPUID after RDTSC, but that would increase measurement overhead, and I think give near-zero gain in accuracy / precision.

LFENCE is a cheaper alternative that's useful with RDTSC. The instruction ref manual entry documents the fact that it doesn't let later instructions start executing until it and previous instructions have retired (from the ROB/RS in the out-of-order part of the core). See Are loads and stores the only instructions that gets reordered?, and for a specific example of using it, see clflush to invalidate cache line via C function. Unlike true serializing instructions like cpuid, it doesn't flush the store buffer.

(On recent AMD CPUs without Spectre mitigation enabled, lfence is not even partially serializing, and runs at 4 per clock according to Agner Fog's testing. Is LFENCE serializing on AMD processors?)

Margaret Bloom dug up this useful link, which also confirms that LFENCE serializes RDTSC according to Intel's SDM, and has some other stuff about how to do serialization around RDTSC.

Thanks for your answer! Actually, we did not want time at all! We wanted a measure of work independent from time, so that frequency changes cannot skew the result. I've found a few performance counters which may help, now I will be looking into a lightweight way to access them without using the sledgehammer that is perf. Hopefully you can program the counters from user-space asm code. — Edd Barrett, Aug 18 '16 at 10:37
Or maybe you can't: http://stackoverflow.com/questions/39021662/how-to-configure-and-sample-intel-performance-counters-in-process — Edd Barrett, Aug 18 '16 at 15:12
You can program counters from user space, but you probably want to pin your threads to cores because PMCs aren't saved/restored on context switches. See http://agner.org/optimize/ for an existing kernel module that gives you PMC access, and also http://stackoverflow.com/questions/38848914/pmu-for-multi-threaded-environment/38984414#38984414 for some discussion of using them. — Peter Cordes, Aug 18 '16 at 15:19
(clarification to previous comment: you can program PMU counters from user-space only via system calls, not directly. Privileged instructions are required. Once programmed, `rdpmc` can work in user-space if the kernel allows it to read those counters even more cheaply than `rdtsc`) — Peter Cordes, Feb 20 '23 at 07:56

score 3 · Answer 2 · answered Aug 17 '16 at 17:31

No, there doesn't seem to be a good reason for the redundant MOV instructions in the inline assembly. The paper first introduces inline assembly with the following statement:

asm volatile (
    "RDTSC\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t": "=r" (cycles_high1), "=r" (cycles_low1));

This has the obvious problem that it doesn't tell the compiler that EAX and EDX have been modified by the RDTSC instruction. The paper points out this mistake and corrects it using clobbers:

asm volatile ("RDTSC\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
    “%eax”, “%edx”)

No other justification is given for writing it this way other than correcting the mistake in the previous example. It appears that the paper's author is simply unaware that it could be written more simply as:

asm volatile ("RDTSC\n\t"
    : "=d" (cycles_high), "=a" (cycles_low));

Similarly the author is apparently unaware that there's a simpler version of the improved asm statement that uses RDTSC in combination with CPUID, as you demonstrate in your post.

Note that the author of the paper repeatedly misuses the term "IA64" to refer the 64-bit x86 instruction set and architecture (variously referred as x86_64, AMD64 and Intel 64). The IA-64 architecture is actually something completely different, it's the one used by Intel's Itaninum CPUs. It has no EAX or RAX registers, and no RDTSC instruction.

While the it doesn't really matter that the authors inline assembly is more complex than it needs to be, this fact combined with the misuse of IA64, something that should've caught by Intel's editors, makes me doubt the credibility of this paper.

Thanks for your answer. If I could mark two answers correct, I would! They do use `cpuid` in the document, that's where I got it from, see page 16. — Edd Barrett, Aug 18 '16 at 10:35
@EddBarrett Yah, I know, I'm saying that the author also doesn't know that the CPUID version of the asm statement in the paper can also be simplified in the same way. — Ross Ridge, Aug 18 '16 at 15:28
Are you sure they misused the term IA64? Or they actually meant Itaninum? — parisa, Jul 29 '18 at 21:35
@parisa I'm sure. An Itanium CPU has no EAX or RAX registers, and no RDTSC instruction. — Ross Ridge, Jul 30 '18 at 00:11

Is Intel's timestamp reading asm code example using two more registers than are necessary?

2 Answers2

Linked