Update: Minimal example demonstrating the problem in Clang 7.0 -
https://wandbox.org/permlink/G5NFe8ooSKg29ZuS
https://godbolt.org/z/PEWiRk
I'm experiencing a variation in performance of a function from 0μs to 500-900μs of a method based on 256 iterations (Visual Studio 2017):
void* SomeMethod()
{
void *result = _ptr; // _ptr is from malloc
// Increment original pointer
_ptr = static_cast<uint8_t*>(_ptr) + 32776; // (1)
// Set the back pointer
*static_cast<ThisClass**>(result) = this; // (2)
return result;
}
If I either comment lines (1) or (2), the timings of the method are 0μs. The inclusion of both lines results in a timing of between 2μs and 4μs per function call.
I'm not convinced that I'm breaking strict aliasing rules and when observing via CompilerExplorer, I can see that setting the back pointer (line (2)) only generates one instruction:
mov QWORD PTR [rax], rcx
Which leads me to wonder whether it can be the strict aliasing causing the compiler to not optimise, when the only affect appears to be 1 extra instruction for the 1 line of code.
For reference, incrementing the original pointer (line (1)) generates two instructions:
lea rdx, QWORD PTR [rax+32776]
mov QWORD PTR [rcx], rdx
And for completeness, here is the complete assembly output:
mov rax, QWORD PTR [rcx]
lea rdx, QWORD PTR [rax+32776]
mov QWORD PTR [rcx], rdx
mov QWORD PTR [rax], rcx
ret 0
What could be the cause of the performance difference? My assumption right now is that the code plays poorly with the CPU's cache, but I just can't work out why the inclusion of one move instruction can cause that?