x86_64 - Why is timing a program with rdtsc/rdtscp giving unreasonably large numbers?

Question

I'm trying to time a subroutine using rdtscp. This is my procedure:

; Setting up time
rdtscp                      ; Getting time
push rax                    ; Saving timestamp

; for(r9=0; r9<LOOP_SIZE; r9++)
mov r9, 0
lup0:
call subr
inc r9
cmp r9, LOOP_SIZE
jnz lup0

; Calculating time taken
pop rbx                     ; Loading old time
rdtscp                      ; Getting time
sub rax, rbx                ; Calculating difference

if LOOP_SIZE is small enough, I get consistent and expected results. However, when I make it big enough (around 10^9) I spike from 10^9 to 10^20.

; Result with "LOOP_SIZE equ 100000000"
971597237
; Result with "LOOP_SIZE equ 1000000000"
18446744072281657066

The method that I'm using to display the numbers displays them as unsigned, so I imagine that the large number displayed is actually a negative number and an overflow happened. However, 971597237 is not even close to the 64 bit integer limit, so, assuming that the problem is an overflow, why is it happening?

`rdtsc` annoyingly puts its result in EDX:EAX even in 64-bit mode. https://www.felixcloutier.com/x86/rdtsc. You're only saving / using the low 32 bits of the TSC, and getting a 32-bit unsigned difference, sign-extended to 64-bit because you're computing it with `sub rax, rbx` on the zero-extended 32-bit values instead of `sub eax, ebx`. — Peter Cordes, Nov 19 '20 at 03:14

Luiz Martins · Accepted Answer · 2021-03-10T12:47:00.447

7

The problem is that as per documentation, the value of rdtscp is not stored on rax, but on edx:eax (which means that the high bits are on edx and the low bits on eax) even on 64 bit mode.

So, if you want to use the full 64-bit value on rax, you have to move the higher bits from edx:

; Setting up time
rdtscp                      ; Getting time
shl rdx, 32                 ; Shifting rdx to the correct bit position
add rax, rdx                ; Adding both to make timestamp
push rax                    ; Saving timestamp

; [...stuff...]

; Calculating time taken
rdtscp                      ; Getting time
pop rbx                     ; Loading old time (below rdtscp)
shl rdx, 32                 ; Shifting rdx to the correct bit position
add rax, rdx                ; Adding both to make timestamp
sub rax, rbx                ; Calculating difference

Edit: Moved pop rbx one line down, below rdtscp. As pointed out by Peter, some registers (rax, rdx and rcx) may be clobbed by rdtscp. In your example that's not a problem, but if you decided to pop rcx there instead, then it'd probably get overwritten by rdtscp, so it's good practice to only pop the stack after it.

Also, you can avoid two calls to the stack by saving the old timestamp in a register that your subroutine doesn't use:

; Setting up time
rdtscp                      ; Getting time
shl rdx, 32                 ; Shifting rdx to the correct bit position
lea r12, [rdx + rax]        ; Adding both to make timestamp, and saving it

; [...stuff (that doesn't use r12)...]

; Calculating time taken
rdtscp                      ; Getting time
shl rdx, 32                 ; Shifting rdx to the correct bit position
add rax, rdx                ; Adding both to make timestamp
sub rax, r12                ; Calculating difference

edited Mar 10 '21 at 12:47

answered Nov 19 '20 at 03:18

Luiz Martins

1,644
10
24

1

Yes, that's correct. Normally you'd use a call-clobbered register like rsi or r8 that you don't need to save/restore around your function, instead of RBX. (`rdtscp` clobbers RCX). Or instead of the stack, you can merge halves into another register with `shl rdx,32` / `lea r8, [rdx + rax]`, choosing a register your timed region doesn't modify. – Peter Cordes Nov 19 '20 at 03:21
1

There's no need to guess about what registers `rdtscp` touches. It writes RDX, RAX, and RCX. https://www.felixcloutier.com/x86/rdtscp. But yeah, since you don't need (or want) the `pop` in the timed region, you can just pop into RCX, if you insist on using the stack at all instead of a register to hold the old time. Then you don't need to touch any registers that RDTSCP didn't already destroy. – Peter Cordes Nov 19 '20 at 04:05
*but if you decided to pop rcx instead, then it'd probably get overwritten by rdtscp* - No, not if you `pop` *after* `rdtscp` like you're doing now. My comment was based on the original version, which popped inside the timed region, and unfortunately I didn't think of just popping after. – Peter Cordes Nov 19 '20 at 13:03
2

`add rax, rdx` / `mov r8, rax` can be replaced by `lea r8, [rdx + rax]` like I suggested originally, which is cheaper on all CPUs in essentially all ways that might matter. Also, if what you're timing is a "subroutine", if it follows the standard calling convention you might want to pick a call-preserved register like `r12`. – Peter Cordes Nov 19 '20 at 13:05
@PeterCordes Yeah, the edit is based on the version on the question, which has `pop ebx` before `rdtscp`. And regarding `lea` comment: Updated :^) – Luiz Martins Nov 19 '20 at 13:36
1

Well the way you state it (right after and in the same paragraph as "Edit: Moved pop rbx one line down") results in what looks like a nonsensical claim about further improvements to your proposed working answer. – Peter Cordes Nov 19 '20 at 13:41

x86_64 - Why is timing a program with rdtsc/rdtscp giving unreasonably large numbers?

1 Answers1