7

When trying to address individual bytes inside an uint64, AVR gcc⁽¹⁾ gives me a strange prologue/epilogue, while the same function written using uint32_t gives me a single ret (the example function is a NOP).

Why does gcc do this? How do I remove this?

You can see the code here, in Compiler Explorer.

⁽¹⁾ gcc 5.4.0 from Arduino 1.8.9 distribution, parameters=-O3 -std=c++11.

Source code:

#include <stdint.h>

uint32_t f_u32(uint32_t x) {
  union y {
    uint8_t p[4];
    uint32_t w;
  };
  return y{ .p = {
    y{ .w = x }.p[0],
    y{ .w = x }.p[1],
    y{ .w = x }.p[2],
    y{ .w = x }.p[3]
  } }.w;
}

uint64_t f_u64(uint64_t x) {
  union y {
    uint8_t p[8];
    uint64_t w;
  };
  return y{ .p = {
    y{ .w = x }.p[0],
    y{ .w = x }.p[1],
    y{ .w = x }.p[2],
    y{ .w = x }.p[3],
    y{ .w = x }.p[4],
    y{ .w = x }.p[5],
    y{ .w = x }.p[6],
    y{ .w = x }.p[7]
  } }.w;
}

Generated assembly for the uint32_t version:

f_u32(unsigned long):
  ret

Generated assembly for the uint64_t version:

f_u64(unsigned long long):
  push r28
  push r29
  in r28,__SP_L__
  in r29,__SP_H__
  subi r28,72
  sbc r29,__zero_reg__
  in __tmp_reg__,__SREG__
  cli
  out __SP_H__,r29
  out __SREG__,__tmp_reg__
  out __SP_L__,r28
  subi r28,-72
  sbci r29,-1
  in __tmp_reg__,__SREG__
  cli
  out __SP_H__,r29
  out __SREG__,__tmp_reg__
  out __SP_L__,r28
  pop r29
  pop r28
  ret
André Kugland
  • 855
  • 8
  • 20
  • Where is your question? – David Grayson Sep 04 '19 at 15:19
  • @DavidGrayson I added it now. – André Kugland Sep 04 '19 at 17:45
  • Looks like some argument passing overhead because the 32-bit int is passed in a register, but there are no 64-bit registers. But I can't say for sure. – Sebastian Redl Sep 06 '19 at 08:11
  • 1
    The functions are competently optimized away because they are not used. Only 64 bit values are returned over a stack so the second function allocates 8 bytes on stack. Remove the function if you want to remove this. You can see the full implementation of the function if you remove the optimization option. – Juraj Sep 06 '19 at 10:38
  • 2
    It is difficult to know what you are looking for. Your questions have the trivial answers that (1) gcc does this because its optimizer is not powerful enough to reduce `f_u64()` to a NOP and (2) you can remove this by removing the function or try to implement it as `return x;`. If these are not the answers you are looking for, perhaps you could reword the question or elaborate in a comment? – nielsen Sep 06 '19 at 22:30
  • nielsen, this is not the actual function I want to write, this is a minimal example of the compiler’s behaviour. – André Kugland Sep 07 '19 at 03:04
  • it is more then minimal, so it doesn't work as minimal. did you read my comment? – Juraj Sep 07 '19 at 06:11
  • Got it. It is a good question, but I do not think I can answer it satisfactorily. It seems `f_u64()` for some reason allocates 72 bytes on the stack and then frees them again. I tried adding a function that takes an `uint64_t`, calls `f_u64()` and returns the result plus 10 and compiled with optimize for size`-Os`. This function does not get any stack acrobatics so it is not a general aspect of passing `uint64_t`. Currently my best guess is that this is some problem with the compiler/optimizer, but I cannot point my finger at it. Personally, I would live with it or try to find a workaround. – nielsen Sep 07 '19 at 07:19

3 Answers3

9

I am not sure if this is a good answer, but it is the best I can give. The assembly for the f_u64() function allocates 72 bytes on the stack and then deallocates them again (since this involves registers r28 and r29, they are saved in the beginning and restored in the end).

If you try to compile without optimization (I also skipped the c++11 flag, I do not think it makes any difference), then you will see that the f_u64() function starts by allocating 80 bytes on the stack (similar to the opening statements you see in the optimized code, just with 80 bytes instead of 72 bytes):

    in r28,__SP_L__
    in r29,__SP_H__
    subi r28,80
    sbc r29,__zero_reg__
    in __tmp_reg__,__SREG__
    cli
    out __SP_H__,r29
    out __SREG__,__tmp_reg__
    out __SP_L__,r28

These 80 bytes are actually all used. First the value of the argument x is stored (8 bytes) and then a lot of moving data around is done involving the remaining 72 bytes.

After that the 80 bytes are deallocated on the stack similar to the closing statements in the optimized code:

    subi r28,-80
    sbci r29,-1
    in __tmp_reg__,__SREG__
    cli
    out __SP_H__,r29
    out __SREG__,__tmp_reg__
    out __SP_L__,r28

My guess is that the optimizer concludes that the 8 bytes for storing the argument can be spared. Hence it needs only 72 bytes. Then it concludes that all the moving around of data can be spared. However, it fails to figure out that this means that the 72 bytes on the stack can be spared.

Hence my best bet is that this is a limitation or an error in the optimizer (whatever you prefer to call it). In that case the only "solution" is to try to shuffle the real code around to find a work-around or raise it as an error on the compiler.

nielsen
  • 5,641
  • 10
  • 27
  • Maybe the reason why the compiler can’t optimize this away is the presence of the CLI instruction in the prologue/epilogue? – André Kugland Sep 12 '19 at 13:43
  • 1
    That could very well be what holds it back. I am actually puzzled by the order of these statements. It seem reasonable to disable interrupts while updating the `SP` (stack pointer), but then it is strange to restore `SREG`(status register which includes the interrupt enable/disable bit) before completing the update of `SP`. – nielsen Sep 12 '19 at 14:24
  • 1
    That's actually an optimization. Similar to a delay slot, the new SREG with interrupts enabled doesn't come into effect until after the next instruction. The ATtiny48 datasheet mentions this as "When using the SEI instruction to enable interrupts, the instruction following SEI will be executed before any pending interrupts, as shown in this example." – Yann Vernier Sep 13 '19 at 20:19
  • @YannVernier And presumably the same is true for updating the SREG with `out`. Thanks for clarifying that. – nielsen Sep 13 '19 at 20:22
  • 1
    Found another description of this delayed interrupt handling behaviour in the [AVR-libc FAQ: Why are interrupts re-enabled in the middle of writing the stack pointer?](https://www.nongnu.org/avr-libc/user-manual/FAQ.html#faq_spman) – Yann Vernier Nov 14 '19 at 18:59
1

You asked how to remove the inefficient code. My answer to your question is that you can just get rid of your function, since it's not performing any calculation and just returning the same value that was passed to it.

If you want to still be able to call that function in other code for some reason, I would do:

#define f_u64(x) ((uint64_t)(x))
David Grayson
  • 84,103
  • 24
  • 152
  • 189
1

The overhead you're seeing is a result of the Endianness of how the CPU stores numbers. In the example you refer to on Compiler Explorer you've selected the Uno - that GCC code generates ASM for the ATmega328P (little-endian). You're also mapping out the uint64 to 8 x uint8, so the compiler needs to turn the high-order and low-order 32-bit portions of the 64-bit number around... and then turn them back on the return. (You will see that godbolt shows there two parts in different colours.)

How do you remove it? That's just the way the ATmega328P works. You will see if you select the Raspbain complier on godbolt that the overhead goes away - because the Endianness of that platform is big-endian.

  • 1
    Could you explain how the generated code which manipulates the stack pointer in a way consistent with allocating 72 bytes on the stack has anything to do with endian conversion? Also why would endian conversion be necessary for the `uint64_t`, but not the `uint32_t`? I am sorry, but I do not understand this answer at all. – nielsen Sep 09 '19 at 09:48