How can I port IAR AVR assembly smac_24 to GCC?

Question

I'm trying to use the IIR filter from application note AVR223, which is written in the IAR dialect of AVR assembly. The main reason being that in C you can only do full 32*32=32 bits multiplications. So I'm especially interested in the following macros:

MUL_MOV_24 MACRO
 // Multiply (signed) high bytes of both coefficient and sample work registers.
 // Then copy the resulting low byte into the accumulator's high byte
 // (sign bit will be correct).
    muls  COEFH, DATAH
    mov   AC2,   R0

 // Multiply (unsigned) low bytes of the coefficient and sample work registers.
 // Copy the resulting high & low byte to the accumulator's low & middle bytes.
    mul  COEFL, DATAL
    mov  AC0,   R0
    mov  AC1,   R1

 // Multiply (signed-unsigned) high coefficient byte and low sample byte.
 // Add resulting low byte to accumulator's middle byte. (May generate carry!)
 // Add, with carry, resulting high byte to accumulator's high byte.
    mulsu  COEFH, DATAL
    add    AC1,   R0
    adc    AC2,   R1

 // Multiply (signed-unsigned) high sample byte and low coefficient byte.
 // Add resulting low byte to accumulator's middle byte. (May generate carry!)
 // Add, with carry, resulting high byte to accumulator's high byte.
    mulsu  DATAH, COEFL
    add    AC1,   R0
    adc    AC2,   R1
    ENDM


// SMAC_24 does the same thing as MUL_MOV_24 except it adds to the accumulator
// from the start, instead of overwriting.
SMAC_24 MACRO
    muls  COEFH, DATAH
    add   AC2,   R0

    mul   COEFL, DATAL
    add   AC0,   R0
    adc   AC1,   R1    // This may generate a carry..
    adc   AC2,   ZERO  // ..which must be added to accumulator's high byte!

    mulsu  COEFH, DATAL
    add    AC1,   R0
    adc    AC2,   R1

    mulsu  DATAH, COEFL
    add    AC1,   R0
    adc    AC2,   R1
    ENDM

Which I tried to convert to the following functions:

int32_t mul_mov_24(int16_t coef, int16_t data) {
  int32_t ac = 0;
  asm (
  "muls  %B[COEF], %B[DATA] \n\t"
  "mov   %C[AC],   r0 \n\t"

  "mul  %A[COEF], %A[DATA] \n\t"
  "mov  %A[AC],   r0 \n\t"
  "mov  %B[AC],   r1 \n\t"

  "mulsu  %B[COEF], %A[DATA] \n\t"
  "add    %B[AC],   r0 \n\t"
  "adc    %C[AC],   r1 \n\t"

  "mulsu  %B[DATA], %A[COEF] \n\t"
  "add    %B[AC],   r0 \n\t"
  "adc    %C[AC],   r1 \n\t"
  : [AC] "=r" (ac)
  : [COEF] "a" (coef),
    [DATA] "a" (data)
  : "r0", "r1");

  return ac;
}

void smac_24(int32_t *ac, int16_t coef, int16_t data) {
  asm (
  "clr r2 \n\t"
  "muls  %B[COEF], %B[DATA] \n\t"
  "add   %C[AC],   r0 \n\t"

  "mul  %A[COEF], %A[DATA] \n\t"
  "add  %A[AC],   r0 \n\t"
  "add  %B[AC],   r1 \n\t"
  "adc  %C[AC],   r2 \n\t"

  "mulsu  %B[COEF], %A[DATA] \n\t"
  "add    %B[AC],   r0 \n\t"
  "adc    %C[AC],   r1 \n\t"

  "mulsu  %B[DATA], %A[COEF] \n\t"
  "add    %B[AC],   r0 \n\t"
  "adc    %C[AC],   r1 \n\t"
  : "=r" (*ac)
  : [COEF] "a" (coef),
    [DATA] "a" (data),
    [AC] "0" (*ac)
  : "r0", "r1", "r2");
}

However, I must be doing something silly, as depending on how I call them and in which order I get completely bullshit results, or even a reset. I have the feeling I'm doing something wrong with input, output and clobbering.

Links to the full code and application note: https://github.com/pepijndevos/accessibletuner/blob/master/tuner/iir.h http://www.microchip.com/wwwappnotes/appnotes.aspx?appnote=en592139

gcc usually knows how to optimize away the unnecessary parts of a multiplication, *if* it knows the value range. e.g. on 32-bit x86 to get a full-multiply from two `uint32_t` values, you can do `a * (uint64_t)b` and it will compile to one `mul` instruction (32x32 => 64 bit), without actually creating zeros for the upper half of both inputs and multiplying them. Try the equivalent for AVR and see if gcc does a good job on its own with pure C. Possibly use `if( foo > 0xffffffUL ) __builtin_unreachable()` to tell gcc about 24-bit values? It might not special-case that, though. — Peter Cordes, Feb 18 '18 at 10:22
Also, usually when your inline-asm starts with `mov`, you should have used better constraints and let the compiler copy the register if it wants. But I think you're using a `mov` to copy before a destructive op because you need the input value multiple times, and that's probably not something the compiler could optimize away at all. — Peter Cordes, Feb 18 '18 at 10:24
It's indeed using __mulhisi3 "Multiply 2 signed 16-bit integers to a 32-bit result" so that's not bad at all. — Pepijn, Feb 18 '18 at 16:03
I also found there is a __int24 type that results in __mulpsi3 calls being generated. This seems to be a 24*24=24 multiplication, which may actually be worse than the 16*16=32? — Pepijn, Feb 18 '18 at 16:40
IDK, have a look at `libgcc.a` or microbenchmark it. You might need to use inline asm if you need every last cycle of performance, and 98% of optimal (or whatever you get with compiler-generated code) isn't good enough. — Peter Cordes, Feb 19 '18 at 03:34

score 1 · Accepted Answer · answered Oct 18 '22 at 11:37

There are at least four obvious bugs in that inline asm code:

Bug: Zero-Reg (R1) must contain 0

avr-gcc assumes that R1 (__zero_reg__) always contains a value of zero. You may temporarily write a different value into that register, for example by means of MUL, but after that sequence you must restore the value of zero. See also section Register Layout in the avr-gcc ABI.

Also notice that just clobbering R1 does not do the trick because R1 is a fixed register, i.e. the register allocator won't generate code for it. You can add clobbers to indicate side effects to someone who is reading the code, but you still have to clear R1 by hand at the end of the inline assembly.

Bug: Early-clobber in Inline Assembly

The 1st inline assembly snippet has the following output and input operands:

  : [AC] "=r" (ac)
  : [COEF] "a" (coef),
    [DATA] "a" (data)

Now it is completely fine for the compiler to allocate ac to registers that may overlap coef and / or data because the inputs die at the asm.

If you read the inputs after you started writing to the output(s), then you are going to overwrite and clobber the inputs. Your constraints are only correct when you consumed all the inputs before writing to the output(s), which is not the case for your code.

Correct code is to early-clobber the output:

  : [AC] "=&r" (ac)

Bug: ac = 0 is void

You are setting ac = 0 prior to the 1st asm, where you claim that the inline asm is writing all 32 bits of ac by means of output constraint =r (ac). This means that the compiler is allowed to optimize away the initialization of ac. Even if ac was initialized, then the high-byte would always be zero, which is a strange result as the output is signed.

What you can do is to use __int24, or you can set %D[AC] in the inline asm as needed.

Similar in the 2nd inline asm: You are never setting the high-byte of *ac, so the high-byte will contain the old value.

Bug: add add adc is wrong

The sequence

  "clr r2"          "\n\t"
  ...
  "mul  %A[COEF], %A[DATA] \n\t"
  "add  %A[AC],   r0 \n\t"
  "add  %B[AC],   r1 \n\t"
  "adc  %C[AC],   r2 \n\t"

is wrong and sub-optimal. The 2nd add should be an adc.

Moreover, the sequence is using R2 just once, hence no need to use an extra (and also expensive because callee-saved) register. You can use CLR which does not change the Carry flag:

  "mul  %A[COEF], %A[DATA] \n\t"
  "add  %A[AC],   r0 \n\t"
  "adc  %B[AC],   r1 \n\t" // Was ADD
  "clr  r1"         "\n\t"
  "adc  %C[AC],   r1 \n\t" // No need to use / clobber R2.

Suboptimal Code: Use movw instead of 2×mov

The 1st asm has

  "mov  %A[AC],   r0 \n\t"
  "mov  %B[AC],   r1 \n\t"

If an AVR device has mul, then it also has movw. Hence:

  "movw %A[AC],   r0 \n\t"

Notice that ac will always start at an even register number.

Suboptimal Code: Use inline Functions

Using inline function(s) will reduce call overhead and will improve register layout, because there is no need to funnel the values through the registers imposed by the ABI. Moreover, int32_t *ac means you took the address of some (likely) local variable, which means that variable has to live in the frame. Not a good choice. Hence:

static inline int32_t mul_mov_24 (int16_t coef, int16_t data)

and similar for the 2nd function, maybe even with __attribute__((__always_inline__)).

How can I port IAR AVR assembly smac_24 to GCC?

1 Answers1