https://godbolt.org/z/dK9v7En5v
For following C++ code
#include <stdint.h>
#include <cstdlib>
void Send(uint32_t);
void SendBuffer(uint32_t* __restrict__ buff, size_t n)
{
for (size_t i = 0; i < n; ++i)
{
Send(buff[0]);
Send(buff[1]);
for (size_t j = 0; j < i; ++j) {
Send(buff[j]);
}
}
}
we have following assembler listing
SendBuffer(unsigned int*, unsigned long):
test rsi, rsi
je .L15
push r13
mov r13, rsi
push r12
mov r12, rdi
push rbp
xor ebp, ebp
push rbx
sub rsp, 8
.L5:
mov edi, DWORD PTR [r12]
call Send(unsigned int)
mov edi, DWORD PTR [r12+4]
call Send(unsigned int)
test rbp, rbp
je .L3
xor ebx, ebx
.L4:
mov edi, DWORD PTR [r12+rbx*4]
add rbx, 1
call Send(unsigned int)
cmp rbx, rbp
jne .L4
.L3:
add rbp, 1
cmp r13, rbp
jne .L5
add rsp, 8
pop rbx
pop rbp
pop r12
pop r13
ret
.L15:
ret
On each loop iteration there is read from memory, while the value could be stored once on register.
It doesn't matter, do we have internal loop or not, compiler do not optimise that construction, I've add the loop to demonstrate that compiler can not rely on processor cache
Is that valid for compiler according to C++ standard to load memory from register once before loop (if we have or don't have __restrict__
keyword)?
Why compiler doesn't do that optimisation if it's valid?
How can I say to compiler that nobody will change that memory and it's valid if now it's not?