-1

Update: Minimal example demonstrating the problem in Clang 7.0 - https://wandbox.org/permlink/G5NFe8ooSKg29ZuS
https://godbolt.org/z/PEWiRk

I'm experiencing a variation in performance of a function from 0μs to 500-900μs of a method based on 256 iterations (Visual Studio 2017):

void* SomeMethod()
{
    void *result = _ptr; // _ptr is from malloc

    // Increment original pointer
    _ptr = static_cast<uint8_t*>(_ptr) + 32776;    // (1)

    // Set the back pointer
    *static_cast<ThisClass**>(result) = this;      // (2)

    return result;
}

If I either comment lines (1) or (2), the timings of the method are 0μs. The inclusion of both lines results in a timing of between 2μs and 4μs per function call.

I'm not convinced that I'm breaking strict aliasing rules and when observing via CompilerExplorer, I can see that setting the back pointer (line (2)) only generates one instruction:

mov QWORD PTR [rax], rcx

Which leads me to wonder whether it can be the strict aliasing causing the compiler to not optimise, when the only affect appears to be 1 extra instruction for the 1 line of code.

For reference, incrementing the original pointer (line (1)) generates two instructions:

lea     rdx, QWORD PTR [rax+32776]
mov     QWORD PTR [rcx], rdx

And for completeness, here is the complete assembly output:

mov     rax, QWORD PTR [rcx]
lea     rdx, QWORD PTR [rax+32776]
mov     QWORD PTR [rcx], rdx
mov     QWORD PTR [rax], rcx
ret     0

What could be the cause of the performance difference? My assumption right now is that the code plays poorly with the CPU's cache, but I just can't work out why the inclusion of one move instruction can cause that?

Mark Ingram
  • 71,849
  • 51
  • 176
  • 230
  • 1ms for 2 instructions only? That doesn't sound right :/ – Matthieu Brucher Feb 01 '19 at 18:17
  • How sure are you of your timings? – YSC Feb 01 '19 at 18:19
  • [clang really, really doesn't like your method](https://godbolt.org/z/4XcdIN). [Gcc agrees](https://godbolt.org/z/__wbfg). – YSC Feb 01 '19 at 18:22
  • 4
    I'm not sure the c++ tag is appropriate for this question. The code shown has undefined behavior which OP acknowledges and for which the output assembly is provided. From that point on, the question becomes strictly about the assembly, regardless of how it was obtained. If the question actually *is* about the code, then the answer is simply "because UB". – François Andrieux Feb 01 '19 at 18:40
  • 4
    @FrançoisAndrieux The question is about code compiled with a C++ compiler, Visual C++ 2017, so the C++ tag is perfectly acceptable. There's no requirement that questions have to answerable within the scope of the C++ standard. – Ross Ridge Feb 01 '19 at 19:44
  • 1
    @RossRidge Well, if I was sure it was off topic I would have removed it myself. Though from the upvote on my comment it seems I'm not the only one with this doubt. My point is if you removed every bit of c++ from the question it essentially would not change in any significant way. Edit : Secondly, c++ expertise is not helpful for answering this question. I don't see why this question should attract attention from the c++ community. – François Andrieux Feb 01 '19 at 19:46
  • @MatthieuBrucher - apologies, I forgot to mention the timings were based around 256 iterations. I've edited to reflect that. – Mark Ingram Feb 01 '19 at 19:56
  • 1
    @FrançoisAndrieux have you even read the description of the C++ tag on? "Use this tag for questions about code (to be) compiled with a C++ compiler." My question is about the code compiled with a C++ compiler, I think it fits incredibly well with the C++ tag. – Mark Ingram Feb 01 '19 at 19:58
  • @YSC Thanks. It looks like both Clang and GCC still doesn't like it even when disabling strict aliasing with ` -fno-strict-aliasing`. – Mark Ingram Feb 01 '19 at 20:08
  • 4
    @FrançoisAndrieux No, fundamentally the question is about the unexpectedly poor performance of C++ code. A good answer would include suggestions how the C++ code can be changed to fix or workaround the problem. (That is if it were answerable, in it's current state it's missing a lot context. It's not even clear how the code is actually violating the strict aliasing rule or otherwise invoking undefined behaviour.) – Ross Ridge Feb 01 '19 at 20:10
  • 1
    @YSC That's because you set `_ptr` to `nullptr` rather than obtaining it from from `malloc`. – Ross Ridge Feb 01 '19 at 20:15
  • 3
    It would be nice if you can provide a Minimal, Complete, and Verifiable example, including the outer loop, the code that takes measurements, and compiler options. – Hadi Brais Feb 01 '19 at 20:24
  • @YSC if you call `malloc` and then try to work around the compiler optimising everything away, then it looks OK - https://godbolt.org/z/a0XOHl – Mark Ingram Feb 01 '19 at 20:31
  • *a timing of between 2μs and 4μs per function call.* That's pretty implausible. Even if the load and both stores miss in all levels of cache, that's only ~100ns + ~60 cycle on a big Xeon (https://www.7-cpu.com/cpu/Haswell.html apparently measured that for a dual-socket E5-2699 v3 18-core per socket.) Multiple microseconds is not really plausible for those 4 instructions unless there's a lot of other stuff going on. Outside of memory-order mis-speculation, IDK how another load could delay this one that much. Intel CPUs load speculatively and then check if they obeyed the x86 memory model. – Peter Cordes Feb 02 '19 at 08:34
  • @HadiBrais, managed to get a small sample together that demonstrates the problem (in Clang 7.0, so not specific to MSVC). Link added to OP, but here for convenience - https://wandbox.org/permlink/3RxgbyzeXomQnBVT – Mark Ingram Feb 02 '19 at 10:51
  • @PeterCordes I've updated the question with a reproducible example - https://wandbox.org/permlink/3RxgbyzeXomQnBVT – Mark Ingram Feb 02 '19 at 10:52
  • 1
    You have the start/stop timer inside the loop!!! If it makes even one system call, it will completely swamp the signal from the actual work you're doing (and depending on Spectre / Meltdown mitigation, flush TLBs and clear branch prediction). On my x86-64 Linux system it doesn't (or the `clock_gettime` system call stays purely in user-space, using code and RDTSC scale factors exported by the kernel in the VDSO page.) But it still probably swamps whatever "real" work you're doing in the loop. – Peter Cordes Feb 02 '19 at 11:19
  • 256 iterations is a tiny amount, probably not enough for the CPU to warm up to max turbo. If I run `./a.out` in a shell loop, I get 0.921875us to 1.0 us. Otherwise with the CPU cold, I typically get 5.2 us. (i7-6700k, max clock set to 3.9GHz, idle at 800MHz). – Peter Cordes Feb 02 '19 at 11:20
  • If the execution time ranges between 500-900μs, then it means that the standard deviation is very large. This makes sense because you're essentially measuring the amount of time it takes to execute `now()`. You should instead pull out the timing code outside the loop and increase the trip count to 1 billion. Then, since each iteration of the loop does the same exact work, we can divide the measurement by 1 billion to get the time per iteration. This would be much more accurate. You cannot measure something whose time is within the range of the deviation of `now()`. – Hadi Brais Feb 02 '19 at 16:38
  • @PeterCordes changing the timing location makes the problem even more pronounced - 45716us vs 5us. https://wandbox.org/permlink/G5NFe8ooSKg29ZuS – Mark Ingram Feb 03 '19 at 10:13
  • 1
    Compiling with MSVC probably won't help you find strict-aliasing violations. I've seen people write that MSVC doesn't implement strict-aliasing based optimizations at all, and that's why type-punning via pointer or reference casting is well-defined on MSVC (and recommended by MS). But yes, gcc doesn't warn about it either. https://godbolt.org/z/-0432N (I had to add `-fkeep-inline-functions` to see asm for `SomeMethod`: asm for `Test()` is useless because the default constructor sets `_ptr = nullptr`, so gcc's `Test()` emits a store to absolute address `0` and then a `ud2` (illegal insn). – Peter Cordes Feb 04 '19 at 00:25

1 Answers1

3

If you comment either of those lines, you're either storing to the same address repeatedly (and probably optimized away in the loop), or you're not storing at all. It's no surprise that the time is unmeasurably short, and rounds to 0 microseconds.

In the test code you linked, you're striding 32kiB per store, on freshly allocated memory with no warmup. You're probably getting a soft page-fault and a copy-on-write every iteration. (The malloced memory was probably all lazily mapped to the same physical zero page.)

256 iterations is also totally insufficient to ramp up the CPU to normal / turbo clock speed, out of idle speed.

On my i7-6700k Arch Linux destkop (idle 800MHz, normal clock speed 3.9GHz, governor / energy_performance_preference = balance_performance (not the default balance_power, so it ramps up faster):

I compiled with gcc8.2.1 and ran the resulting executable in a loop with while ./a.out ;do :;done so the CPU would stay at high clock speed. The program prints times like 1.125us +- a bit. That sounds maybe about right for a page-fault + zeroing a page + updating the page tables and flushing TLBs.


Using Linux perf stat, I ran it 100 times with averaging of the counts. (The "rate" secondary stats column has bogus units because of a perf bug that Arch hasn't updated to fix yet. So it's actually measuring 4.4GHz (which I think is bogus, Turbo is disabled on my CPU to keep the fans quiet.)

peter@volta:/tmp$ perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,dtlb_store_misses.miss_causes_a_walk,tlb_flush.dtlb_thread,dtlb_load_misses.miss_causes_a_walk -r100 ./a.out


 Performance counter stats for './a.out' (100 runs):

              1.15 msec task-clock                #    0.889 CPUs utilized            ( +-  0.33% )
                 0      context-switches          #   40.000 M/sec                    ( +- 49.24% )
                 0      cpu-migrations            #    0.000 K/sec                  
               191      page-faults               # 191250.000 M/sec                  ( +-  0.09% )
         4,343,915      cycles                    # 4343915.040 GHz                   ( +-  0.33% )  (82.06%)
           819,685      branches                  # 819685480.000 M/sec               ( +-  0.05% )
         4,581,597      instructions              #    1.05  insn per cycle           ( +-  0.05% )
         6,366,610      uops_issued.any           # 6366610010.000 M/sec              ( +-  0.05% )
         6,287,015      uops_executed.thread      # 6287015440.000 M/sec              ( +-  0.05% )
             1,271      dtlb_store_misses.miss_causes_a_walk # 1270910.000 M/sec                 ( +-  0.21% )
     <not counted>      tlb_flush.dtlb_thread                                         (0.00%)
     <not counted>      dtlb_load_misses.miss_causes_a_walk                                     (0.00%)

        0.00129289 +- 0.00000489 seconds time elapsed  ( +-  0.38% )

These counts include kernel mode, but that's 191 page faults for 256 loop iterations, so the vast majority of the time spent by this program is in the kernel.

And once we get back to user space, over 1000 stores causes dTLB misses that also missed in the 2nd-level TLB, requiring a page-walk. But no loads did.

We could probably get much cleaner data by allocating a lot more memory so we could increase Count without segfaulting. Profiling with perf record shows that only about 20% of the program's total time is spent in main; the rest is in dynamic linker / startup overhead, partly from printing.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Thanks Peter. The 256 wasn't for just for the benchmark, it was representative of the real-world code (of which the Wandbox is an attempt at a cut-down version). Your answer ties in perfectly though, the avg time for 256 iterations is 2us, whereas for 256000 iterations (resetting pointer every 256 calls) the average time is 0.006us. – Mark Ingram Feb 04 '19 at 21:11