Why compiler does not optimise RAM lookups?

Question

For following C++ code

#include <stdint.h>
#include <cstdlib>

void Send(uint32_t);

void SendBuffer(uint32_t* __restrict__ buff, size_t n)
{
    for (size_t i = 0; i < n; ++i)
    {
        Send(buff[0]);
        Send(buff[1]);  
        for (size_t j = 0; j < i; ++j) {
            Send(buff[j]);   
        }
    }
}

we have following assembler listing

SendBuffer(unsigned int*, unsigned long):
        test    rsi, rsi
        je      .L15
        push    r13
        mov     r13, rsi
        push    r12
        mov     r12, rdi
        push    rbp
        xor     ebp, ebp
        push    rbx
        sub     rsp, 8
.L5:
        mov     edi, DWORD PTR [r12]
        call    Send(unsigned int)
        mov     edi, DWORD PTR [r12+4]
        call    Send(unsigned int)
        test    rbp, rbp
        je      .L3
        xor     ebx, ebx
.L4:
        mov     edi, DWORD PTR [r12+rbx*4]
        add     rbx, 1
        call    Send(unsigned int)
        cmp     rbx, rbp
        jne     .L4
.L3:
        add     rbp, 1
        cmp     r13, rbp
        jne     .L5
        add     rsp, 8
        pop     rbx
        pop     rbp
        pop     r12
        pop     r13
        ret
.L15:
        ret

On each loop iteration there is read from memory, while the value could be stored once on register.

It doesn't matter, do we have internal loop or not, compiler do not optimise that construction, I've add the loop to demonstrate that compiler can not rely on processor cache

Is that valid for compiler according to C++ standard to load memory from register once before loop (if we have or don't have __restrict__ keyword)? Why compiler doesn't do that optimisation if it's valid? How can I say to compiler that nobody will change that memory and it's valid if now it's not?

https://godbolt.org/z/MsP5sdGvG I am talking about buff[0] and buff[1], it could be loaded once on register. O3 doesn't help (and why it should?) — pvl, Feb 21 '23 at 17:53
On the ARM processor, there are instructions to get items from an array with one fetch or one instruction. Can't get more optimized than this. — Thomas Matthews, Feb 21 '23 at 17:54
But you can fetch buff[0] and buff[1] once before the cycle or fetch it one each iteration. Second option is better, right? But compiler chooses first — pvl, Feb 21 '23 at 17:55
If the body is known, gcc performs the optimization: https://godbolt.org/z/Kn3zxfrhj. Clang seems to do this but only with `buff[0]` in ebp (it reloads `buff[1]` every loop): https://godbolt.org/z/e3ocWz1n1 — Artyer, Feb 21 '23 at 17:57
`buff[0]` and `buff[1]` are invariants (don't change), so move them outside of the first `for` loop. Change the `i` loop to start at 2 and get rid of the `j` loop. — Thomas Matthews, Feb 21 '23 at 17:58
If it saved it in a register, the register might get reused by the `Send()` function. — Barmar, Feb 21 '23 at 17:58
Yeah, I know I can move that variable out from loop by myself and store it in register. The question is, why compiler does not optimise it automatically? — pvl, Feb 21 '23 at 17:59
@Artyer woah, interesting, but why it's important to know `Send` function body for such optimisation? — pvl, Feb 21 '23 at 18:01
If the internals of `Send` are unknown, it's really hard to set up things in an optimal way to make the call efficiently. Who knows what's going on in there! The compiler certainly doesn't at this point. — tadman, Feb 21 '23 at 18:03
@tadman: If it stores `buff[0]` in a register, and the `Send` function is complex, then the `Send` function will likely write that register to the stack (memory) when it starts, and then read that register back from the stack (memory) before it exits. If so, then it's actually faster to NOT use the register, because then each loop has only a read, instead of a read and a write. — Mooing Duck, Feb 21 '23 at 18:09
You should also try using a pointer to `buff` rather than accessing it directly. On compilers, it will dedicate a register to the pointer and RAM access using pointers and offsets is fast. Although the compiler may perform this at higher optimization levels. — Thomas Matthews, Feb 21 '23 at 18:25
IMHO, You should profile, especially the `Send` function. My guess is the bottleneck is in the `Send` function and other attempts in your code are micro-optimizations and won't generate as much benefit as optimizing the `Send` function, *if you are allowed to modify the `Send` function*. — Thomas Matthews, Feb 21 '23 at 18:27
I think it's because `restrict` ensures that direct memory writes within `SendBuffer` will not modify the objects that `buff` points to. However, this guarantee does not extend to effects that the `Send` function can have. Hence, it would be illegal for the compiler to optimize the code. — Lindydancer, Feb 21 '23 at 18:43

score 0 · Answer 1 · edited Feb 21 '23 at 18:58

0

You could help the compiler by rearranging your code, so that you can see the impact of RAM optimizations.

void SendBuffer(uint32_t* __restrict__ buff, size_t n)
{
    // Access RAM sequentially to take advantage of the data cache.
    const uint32_t a = buff[0];
    const uint32_t b = buff[1];

    for (uint32_t i = 0u; i < n; ++i)
    {
        Send(a);
        Send(b);

        // Start at the third buffer slot.
        for (size_t j = 2; j < n; ++j)
        {
            Send(buff[j]);   
        }
    }
}

In the above code, the bottleneck is the call to Send. Accessing the buff array is much faster. Also, the branch evaluations in the loops take more time than accessing the array.

The true optimization here, should be to modify the Send so that it transfers blocks and not words. Most device communications have a block transfer capability.

Otherwise you can try unrolling the loop. (The compiler may perform loop unrolling a higher optimization levels)

size_t j;
for (j = 2u; (j + 4u) < n; j += 4)
{
     // Optimization:  load consecutively from data cache to reduce
     // the quantity of cache reloads.  
     const uint32_t a = buff[j + 0u];
     const uint32_t b = buff[j + 1u];
     const uint32_t c = buff[j + 2u];
     const uint32_t d = buff[j + 3u];

     // Send a "block" of data:
     Send(a);
     Send(b);
     Send(c);
     Send(d);
}
// Send the remaining words:
for (; j < n; ++j)
{
    Send(buff[j]);   
}

Examining the assembly language should show a better organized and optimized code.

Edit 1: Included outer loop, corrected index variable usage.

edited Feb 21 '23 at 18:58

463035818_is_not_an_ai

109,796
11
89
185

answered Feb 21 '23 at 18:03

Thomas Matthews

56,849
17
98
154

The question is why I should do it by myself and compiler can not perform such optimisation automatically – pvl Feb 21 '23 at 18:06
Corrected the `Send[j]` to `Send[i]`. – Thomas Matthews Feb 21 '23 at 18:13
You say "Accessing the buff array is much faster.". Why is this? Is this because `Send` might be writing/restoring the registers from the stack but the direct read sidesteps this, and is therefore faster? Or were you thinking of something else? – Mooing Duck Feb 21 '23 at 18:16
1

Accessing the `buff` array is faster than calling a function. Accessing an array item is a couple of instructions and worst case reloading the data cache. Calling a function requires a function setup (including pushing arguments and return address) as well as going through the branch decision process. The branches have a worst case of flushing the instruction cache and going through the branch decision process. Without the function call, each iteration has at least 2 branches. The decision process takes time which could be better spent processing instructions. – Thomas Matthews Feb 21 '23 at 18:21
1

`for (size_t j = 2; i < n; ++i)` <- this probably isn't what you meant – user253751 Feb 21 '23 at 18:51
Unless `Send` does a huge amount of stuff, `Send(buff[j]);` in a loop should be hitting in cache on every access. Unrolling to do multiple loads and then multiple Sends won't improve cache hit-rate; the accesses are already close enough (temporally) that the line should still be hot in cache. Unless `Send` is very very expensive, like a `write` system call, in which case yeah it's insane to be calling it one word at a time. Hoisting the loads of `buff[0]` and `buff[1]` is a good idea, though; those might not still be hot for larger `i`. – Peter Cordes Feb 21 '23 at 21:00
2

@pvl: As Lindydancer [commented under the question](https://stackoverflow.com/questions/75524125/why-compiler-does-not-optimise-ram-lookups#comment133250358_75524125), `__restrict` doesn't constrain `Send` from possibly modifying the array via some global pointer, so the compiler can't hoist those 2 loads for you, unless it can inline `Send` and see that it doesn't do that. – Peter Cordes Feb 21 '23 at 21:03

Why compiler does not optimise RAM lookups?

1 Answers1