4

I have a code like that:

const uint64_t tsc = __rdtsc();
const __m128 res = computeSomethingExpensive();
const uint64_t elapsed = __rdtsc() - tsc;
printf( "%" PRIu64 " cycles", elapsed );

In release builds, this prints garbage like “38 cycles” because VC++ compiler reordered my code:

    const uint64_t tsc = __rdtsc();
00007FFF3D398D00  rdtsc  
00007FFF3D398D02  shl         rdx,20h  
00007FFF3D398D06  or          rax,rdx  
00007FFF3D398D09  mov         r9,rax  
    const uint64_t elapsed = __rdtsc() - tsc;
00007FFF3D398D0C  rdtsc  
00007FFF3D398D0E  shl         rdx,20h  
00007FFF3D398D12  or          rax,rdx  
00007FFF3D398D15  mov         rbx,rax  
00007FFF3D398D18  sub         rbx,r9  
    const __m128 res = …
00007FFF3D398D1B  lea         rdx,[rcx+98h]  
00007FFF3D398D22  mov         rcx,r10  
00007FFF3D398D25  call        computeSomethingExpensive (07FFF3D393E50h)  

What’s the best way to fix?

P.S. I’m aware rdtsc doesn’t count cycles, it measures time based on CPU’s base frequency. I’m OK with that, I still want to measure that number.

Update: godbolt link

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • 3
    You can mark the variables `volatile`. It should work. – ALX23z Jan 25 '22 at 10:02
  • 1
    @ALX23z Indeed, but volatile also evicted them from registers to stack memory, so each `__rdtsc` compiles into these instructions: rdtsc / shl rdx,20h / or rax, rdx / mov qword ptr [rsp+50h], rax – Soonts Jan 25 '22 at 10:09
  • 1
    Compiler bug perhaps? I don't think `__rdtsc` was reordered in older MSVC versions. Certainly it wasn't reordered if you add `_mm_sfence();` around the call, but looks like even that doesn't help. – rustyx Jan 25 '22 at 10:28
  • @rustyx Adding `_mm_sfence()` after the second `__rdtsc()` call helped, at least in my project. I wonder is that the best way to go? – Soonts Jan 25 '22 at 10:41
  • I can't see why a fence there _should_ work, so I'd assume it's a brittle fix and may break with the next optimizer update. What happens with an `std::atomic` store or fence between the two rdtsc intrinsics? You can see what the compiler does for different memory orders... – Useless Jan 25 '22 at 10:47
  • For me (VC++ 19.30.30709) it doesn't help in /O2. https://godbolt.org/z/dffojYaoz – rustyx Jan 25 '22 at 10:47
  • @rustyx: You're choosing `_mm_sfence()` as a cheap instruction (on Intel, not AMD where it's a full barrier) that has no interaction with `rdtsc`? Normally you'd use `lfence` to prevent runtime reordering (out-of-order exec) of RDTSC with the timed region, although that works by draining the ROB and adds some extra fixed cost to the timed region. So if `_mm_lfence()` works, it's actively helpful in some ways to the generated code, instead of just extra overhead like `sfence` or a store would be. – Peter Cordes Jan 25 '22 at 10:58
  • @PeterCordes I'm mostly working with Xeon so yes I thought `sfence` would be cheaper, but good point about AMD and `lfence`. Actually I think my previous example is wrong. Function calls with no dependency chain to memory may be exempted from memory fences, this makes sense. We might need a better MCVE to see what the function under test does exactly. – rustyx Jan 25 '22 at 11:23
  • @rustyx My function takes a million cycles to run, and reads several megabytes of memory. It doesn’t make any system calls, it only crunches numbers with AVX, on the single thread. – Soonts Jan 25 '22 at 11:27
  • @rustyx: A function that's not visible to the optimizer (i.e. not inlinable) should be a black box that it can't make any assumptions about in terms of reordering it wrt. memory barriers. e.g. if it just sees a prototype in this compilation unit. – Peter Cordes Jan 25 '22 at 11:56
  • @PeterCordes I think modern optimizers are looking into all functions, otherwise LTCG wouldn’t work. That’s not specific to VC++, other compilers are doing pretty much the same thing, they just call it LTO. – Soonts Jan 25 '22 at 12:22
  • 1
    @Soonts: Obviously you have to disable link-time optimization for separate compilation units to block optimization. e.g. for GCC, simply don't use `-flto`. – Peter Cordes Jan 25 '22 at 12:33
  • I could not reproduce this issue with any function that involves memory access. Please provide a demo, preferably on godbolt. – rustyx Jan 25 '22 at 13:06
  • @rustyx See the update. BTW, my production function doesn’t store anything into memory, it only loads, just like that example. – Soonts Jan 25 '22 at 13:16
  • Loading `const`s from memory doesn't depend on the order of any other memory access. So it's no wonder memory fences have no effect. I think `volatile` is your best option at this point. Just discount a nanosecond for the extra access to memory and be done with it. – rustyx Jan 25 '22 at 13:26
  • @rustyx I wonder what’s cheaper, reading/writing two integers from/to the stack with `volatile`, or calling a noinline wrapper around `__rdtsc`? https://godbolt.org/z/x7f5sssWY – Soonts Jan 25 '22 at 13:31
  • 2
    [`_ReadBarrier`](https://learn.microsoft.com/en-us/cpp/intrinsics/readbarrier?view=msvc-170) works but is allegedly deprecated... but it seems to have the least impact. I put it before the second `rdtsc` read. [on godbolt](https://godbolt.org/z/q5qc877oW) – Mgetz Jan 25 '22 at 13:32
  • @Mgetz It seems `atomic_uint64_t` evicted to the stack just like `volatile`. The `_ReadBarrier` is the best workaround I saw so far, maybe I should just ignore the documentation which says it's deprecated :-) – Soonts Jan 25 '22 at 13:44
  • 1
    @Soonts yeah I did a diff on godbolt and removed that comment for good reason. It treats the atomics as volatile (despite having different guarantees in the standard). My next step was going to be using different load and store options on the atomics to basically prevent things from moving. – Mgetz Jan 25 '22 at 13:52
  • Checked and [and no improvement](https://godbolt.org/z/1odWcWh6b) – Mgetz Jan 25 '22 at 13:58
  • 1
    @Mgetz Thanks for your help. I have confirmed the `_ReadBarrier` intrinsic does the right thing in my production code built with VS2022. – Soonts Jan 25 '22 at 14:17

1 Answers1

0

Adding a fake store

static bool save = false;
if (save)
{
   static float res1[4];
   _mm_store_ps(res1, res);
}

before the second __rdtsc seem to be enough to fool the compiler.

(Not adding a real store to avoid contention if this function is called in multiple threads, though could use TLS to avoid that)

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
  • That doesn't seem better than a real store to the stack, by using a `volatile` non-static local var to save the first `rdtsc` result. Still just a single uop, vs. probably two to check a global. Of course, if nothing ever modifies `static bool save`, constant propagation will result in no code at all, so maybe no effect? You might need a non-static global var, or `volatile static bool save = false`. But then test and branch on it is probably at least 2 uops; micro+macro fusion of a cmp/jcc only happens with `reg, mem` or `reg, reg` so the compiler would need a zeroed reg to make it 1 uop – Peter Cordes Jan 25 '22 at 12:34
  • 1
    @Peter, `volatile` to store the first `rdtsc` result does not work, the computation still moves past the second: https://godbolt.org/z/GqMdGWv1j – Alex Guteniev Jan 25 '22 at 15:20
  • @AlexGuteniev: returning an unused compile-time constant is a weird test, though. If I `return _mm_set1_ps(global_var);`, I get a `vbroadcastss` inside the timed region. https://godbolt.org/z/GYsjKdn8n . Oh, but Rustyx already commented under the question that this is specific to functions that don't touch memory at all, so I guess just return a constant, or something dependent on args. So one fix would be to make the function under test store a scalar or 128-bit vector (to avoid stack over-alignment overhead) to a volatile local or global. – Peter Cordes Jan 25 '22 at 20:16