Moving data into __uint24 with assembly

Question

I originally had the following C code:

volatile register uint16_t counter asm("r12");

__uint24 getCounter() {
  __uint24 res = counter;
  res = (res << 8) | TCNT0;
  return res;
}

This function runs in some hot places and is inlined, and I'm trying to cram a lot of stuff into an ATtiny13, so it came time to optimize it.

That function compiles to:

getCounter:
        movw r24,r12
        ldi r26,0
        clr r22
        mov r23,r24
        mov r24,r25
        in r25,0x32
        or r22,r25
        ret

I came up with this assembly:

inline __uint24 getCounter() {
  //__uint24 res = counter;
  //res = (res << 8) | TCNT0;
  
  uint32_t result;
  asm(
    "in %A[result],0x32" "\n\t"
    "movw %C[result],%[counter]" "\n\t"
    "mov %B[result],%C[result]" "\n\t"
    "mov %C[result],%D[result]" "\n\t"
    : [result] "=r" (result)
    : [counter] "r" (counter)
    :
  );
  return (__uint24) result;
}

The reason for uint32_t is to "allocate" the fourth consecutive register and for the compiler to understand it is clobbered (since I cannot do something like "%D[result]" in the clobber list)

Is my assembly correct? From my testing it seems like it is. Is there a way to allow the compiler to optimize getCounter() better so there's not need for confusing assembly? Is there a better way to do this in assembly?

EDIT: The whole idea with the movw is to keep the read atomic, since the counter variable is incremented inside of an interrupt.

vvv444 · Accepted Answer · 2023-04-21T08:02:32.457

As it seems from my experiments in GodBolt, even with the -O3 flag avr-gcc optimizer is just not sophisticated enough. I doubt there are any other flags that can trick it into optimizing this specific code more (I tried some but none helped).

But there is an alternative way to write the some code using union and in that case compiler optimizes the assembly better. Thus, no need to resort to inline assembly.

Original code analysis

The counter variable is stored in r12 (LSB) and r13 (MSB) registers.
TCNT0 is read from I/O space address 0x32 (by in Rd, 0x32 instruction).
According to the avr-gcc ABI, the 24-bit value is returned in r22(LSB):r23:r24(MSB).
So to summarize, we want the following transfer to occur:
```
r24 <-- r13
r23 <-- r12
r22 <-- TCNT0   
```

Even newer solution (no inline assembly!)

Looking into the code, I guess you have some kind of timer interrupt incrementing counter when the timer reaches some upper threshold. If that's the case, the code has a deeper problem, even in the pure C version. The important part is that the read of both TCNT0 and counter should be atomic together as single unit! Otherwise, if the interrupt occurs between the movw and in instructions, your result will be inaccurate. Example of scenario demonstrating the bug:

counter = 0x0010, TCNT0 = 0xff
MOVW copies 0x0010
Interrupt occurs => handler sets counter = 0x0011 and TCNT0 = 0 
IN instruction reads TCNT0 = 0
result = 0x0010_00 (instead of expected 0x0010_ff)

There are two ways for to solve this:

Wrap CLI / SEI around the two reads to get them together without possible interrupt in the middle.
Read TCNT0 twice, before and after reading the counter. If the second read gives smaller result, it means an interrupt in between and we can't trust the values, repeat the whole read.

Thus, a correct solution, without the bug might be like this (add inline specification on the function as needed):

__uint24 getCounter() {
  union
  {
    __uint24 result;

    struct {
      uint8_t lo;
      uint16_t hi;
    } parts;
  } u;

  __builtin_avr_cli();
  u.parts.hi = counter;
  u.parts.lo = TCNT0;
  __builtin_avr_sei();

  return u.result;
}

Producing:

getCounter:
        cli
        mov r23,r12
        mov r24,r13
        in r22,0x32
        sei
        ret

Godbolt: https://godbolt.org/z/YrWrT8sT4

Newer solution (less assembly, partial atomicity)

With the atomicity requirement added, we must use the movw instruction. Here is a version that minimizes the amount of inline assembly and uses as much C as possible:

__uint24 getCounter() {
  union
  {
    __uint24 result;

    struct {
      uint8_t lo;
      uint16_t hi;
    } parts;
  } u;

  uint16_t tmp;

  // Ensure the counter is read atomically with movw instruction
  asm("movw %C[tmp],%[counter]  \n\t" : [tmp] "=r" (tmp) : [counter] "r" (counter));

  u.parts.hi = tmp;
  u.parts.lo = TCNT0;

  return u.result;
}

Godbolt: https://godbolt.org/z/P9a9K6n76

Old solution (without atomicity)

Question author's assembly analysis

It looks correct and provides the right results. However, there are two things I can suggest to improve:

It has 3 mov instructions, taking 3 cycles to execute. gcc generated similar code because movw operates only on evenly aligned registers. But you can replace these with just 2 mov instructions and it will also remove the need for the larger uint32_t variable.
I would avoid hardcoding TCNT0 address for better code portability.

Suggested assembly

So here is a slightly modified version of your code:

inline __uint24 getCounter() {
  __uint24 result;
  asm(
    "in   %A[result], %[tcnt0]"    "\n\t"
    "mov  %B[result], %A[counter]" "\n\t"
    "mov  %C[result], %B[counter]" "\n\t"
    : [result] "=r" (result)
    : [counter] "r" (counter)
    , [tcnt0] "I" (_SFR_IO_ADDR(TCNT0))
  );
  return result;
}

However, note a downside of this solution – we loose atomicity on reading the counter. If an interrupt occurs between the two mov instructions and counter is modified inside the interrupt, we might get correct results. But if counter is never modified by interrupts, I would prefer using the two separate mov instructions for performance benefits.

Godbolt: https://godbolt.org/z/h3nT4or97 (I removed inline keywords to show the generated assembly)

Ideally GCC could use a `__uint24` that started with an odd register, allowing just `in` + `movw`. But there may be no way to request that. If `res = (counter << 8) | TCNT0;` doesn't compile to that, GCC's probably not going to do it when generating code around inline asm either. I tried having inline asm produce the result in the high 3 bytes of a `uint32_t`, and hoped that GCC would be able to optimize away a `result >> 8` and just use the high 3 registers, e.g. when storing. But no, it does 3 useless `mov` + 1x `clr` before 3 `std` insns :/ https://godbolt.org/z/1v7dKWbMb — Peter Cordes, Apr 20 '23 at 10:11
@PeterCordes In any case, the return value is dictated by the ABI to be located in `r22:r24`, so you couldn't choose the alignment anyway. — vvv444, Apr 20 '23 at 10:29
If you look at the Godbolt link, I was working around that (to simulate inlining into a large function where the `result` isn't visible at an ABI boundary) by taking a `__uint24 *out` function arg, and doing `*out = ...;`. I changed the function to return `void`. — Peter Cordes, Apr 20 '23 at 10:41
@PeterCordes Ah, I see. Missed that. On the other hand, you can't predict which alignment will the calling context need. It might have constraints requiring the return value to be aligned to odd or even boundary. So one way or another additional move might be required. — vvv444, Apr 20 '23 at 10:46
Yeah it might, but I wondered whether we could save instructions for the case where it doesn't. Yes in theory, but not with GCC's braindead behaviour that insists on forming the `result >> 8` result in 4 registers instead of just reading the 3 registers that contain the data. — Peter Cordes, Apr 20 '23 at 10:52
Indeed, atomicity of the read is one of the requirements for this function in most places, unfortunatelly. However your analysis is still great, thank you. — Kryštof Vosyka, Apr 20 '23 at 14:45
@KryštofVosyka Again, is the counter changed by interrupts that preempt this function? If it is not - atomicity is still preserved. — vvv444, Apr 20 '23 at 14:54
@KryštofVosyka Please see my updates, I think that might give you some additional important info. — vvv444, Apr 20 '23 at 19:25

emacs drives me nuts · Answer 2 · 2023-04-20T15:36:37.763

You will read the value of counter in R13:R12, so you need two MOV's and one IN to read TCNT0. So a working version using inline assembly is:

#include <avr/io.h>

register uint16_t counter asm("r12");

static inline __attribute__((__always_inline__))
__uint24 getCounter (void)
{
    __uint24 result;

    __asm ("mov %B0, %A1" "\n\t"
           "mov %C0, %B1"
           : "=r" (result)
           : "r" (counter), "0" (TCNT0));

    return result;
}

Some notes on that solution:

Maximal inlining is achieved with static inline and always_inline.
TCNT0 is read in the C/C++ code, not in the assembly so the complier can chose the best instruction to read that SFR (IN or LDS depending on arch). It's also more convenient as there's no need for __SFR_IO_ADDR gobbledegook from AVR-LibC.
GCC will allocate the reg which reads TCNT0 to the same register like result. As avr-gcc ABI is little endian, so it will be allocated to the LSB of result. This is all fine with GCC inline assembly, even though TCNT0 and result have incompatible types.
Global register variables like count can't be volatile, and GCC will warn:
```
warning: optimization may eliminate reads and/or writes to register variables [-Wvolatile-register-var]
volatile register uint16_t counter asm("r12");
^~~~~~~~
```
Reason is historical representation where internal representation of REG doesn't even have a volatile property. So you might rethink your code. For example, looping like while (counter != 0) ... might not do what you are expecting.
Using global register variables like counter comes with some caveats: For every module / compilation unit the compiler must know that it must not allocate variables to some register that are otherwise freely available. Hence, you can include the decl of counter in each and ever module, including the ones that don't even use counter. Or better still, compile all modules with -ffixed-12 -ffixed-13. To reduce interference with the calling convention, better use R2 instead of R12. Notice that R12 might be used to pass parameters, and code from libc / libgcc might also use R12, because there's no way for these libs to know that R12 (or R2 for that matter) is forbidden.

An example that uses the code above and shows generated assembly, is to compile the following code with -Os -save-temps.

void f (int, __int24);

int main (void)
{
    f (0, getCounter() /* in R22:R20 */);
}

*.s will read:

main:
    in r20,0x32
/* #APP */
    mov r21, r12
    mov r22, r13
/* #NOAPP */
    ldi r25,0
    ldi r24,0
    rcall f
    ...

Reading `counter` atomically

As mentioned in a comment, counter sould be read atomically. Using movw is 1 tick, thus faster than cli / sei sequence. It's enough to use a 24-bit variable. Though I am not sure whether that one register less in register pressure would even make a difference. Anyways, here is a solution with movw. The SFR is read in the assembly, so it turns volatile:

static inline __attribute__((__always_inline__))
__uint24 getCounter (void)
{
    __uint24 result;
    __asm volatile ("movw %A0, %A1" "\n\t" // Atomic read of counter.
                    "mov  %C0, %B0" "\n\t"
                    "mov  %B0, %A0" "\n\t"
                    "in   %A0, %i2"
                    : "=r" (result)
                    : "r" (counter), "n" (&TCNT0));
    return result;
}

Notice that inline assembly operand print modifier i was introduced in v4.7 which is the same version that brought __uint24; so no head scratching about %i.

Interesting point about `volatile register` not being fully supported. It does work differently from a plain local var, but https://godbolt.org/z/nj8c5c4bW shows some cases where optimizations don't fully respect `volatile`. As GCC itself warns about with `-Wall`, "optimization may eliminate reads and/or writes to register variables". I tried the same function with a non-`volatile` register global, and it optimizes away a lot more. — Peter Cordes, Apr 20 '23 at 14:49
I'm aware of all the problems using register variables, however it's the only way to keep the code fast. Thanks for pointing them out, though. — Kryštof Vosyka, Apr 20 '23 at 14:51
Unfortunatelly I need the atomicity of the `movw` instruction, see edit in question. — Kryštof Vosyka, Apr 20 '23 at 14:52
@Kryštof Vosyka: I added an atomic version using `movw`. It's very similar to your code in the question, but tidied up a bit (no magic numbers etc.). — emacs drives me nuts, Apr 20 '23 at 15:41