4

I'm using the latest available version of ARM-packaged GCC:

arm-none-eabi-gcc (GNU Arm Embedded Toolchain 10-2020-q4-major) 10.2.1 20201103 (release) Copyright (C) 2020 Free Software Foundation, Inc.

When I compile this code using "-mcpu=cortex-m0 -mthumb -Ofast":

int main(void) {
    uint16_t num = (uint16_t) ADC1->DR;
    ADC1->DR = num / 7;
}

I would expect that the division would be accomplished by a multiplication and a shift, but instead this code is being generated:

 08000b5c <main>:
 8000b5c: b510 push {r4, lr}
 8000b5e: 4c05 ldr r4, [pc, #20] ; (8000b74 <main+0x18>)
 8000b60: 2107 movs r1, #7
 8000b62: 6c20 ldr r0, [r4, #64] ; 0x40
 8000b64: b280 uxth r0, r0
 8000b66: f7ff facf bl 8000108 <__udivsi3>
 8000b6a: b280 uxth r0, r0
 8000b6c: 6420 str r0, [r4, #64] ; 0x40
 8000b6e: 2000 movs r0, #0
 8000b70: bd10 pop {r4, pc}
 8000b72: 46c0 nop ; (mov r8, r8)
 8000b74: 40012400 .word 0x40012400

Using __udivsi3 instead of multiply and shift is terribly inefficient. Am I using the wrong flags, or missing something else, or is this a GCC bug?

Dan Sandberg
  • 549
  • 1
  • 4
  • 15

3 Answers3

4

The Cortex-M0 lacks instructions to perform a 32x32->64-bit multiply. Because num is an unsigned 16-bit quantity, multiplying it by 9363 and shifting right 16 would yield a correct result in all cases, but--likely because a uint16_t will be promoted to int before the multiply, gcc does not include such optimizations.

From what I've observed, gcc does a generally poor job of optimizing for the Cortex-M0, failing to employ some straightforward optimizations which would be appropriate for that platform, but sometimes employing "optimizations" which aren't. Given something like

void test1(uint8_t *p)
{
    for (int i=0; i<32; i++)
        p[i] = (p[i]*9363) >> 16; // Divide by 7
}

gcc happens to generate okay code for the Cortex-M0 at -O2, but if the multiplication were replaced with an addition the compiler would generate code which reloads the constant 9363 on every iteration of the loop. When using addition, even if the code were changed to:

void test2(uint16_t *p)
{
    register unsigned u9363 = 9363;
    for (int i=0; i<32; i++)
        p[i] = (p[i]+u9363) >> 16;
}

gcc would still bring the load of the constant into the loop. Sometimes gcc's optimizations may also have unexpected behavioral consequences. For example, one might expect that on a platform like a Cortex-M0, invoking something like:

unsigned short test(register unsigned short *p)
{
    register unsigned short temp = *p;
    return temp - (temp >> 15);
}    

while an interrupt changes the contents of *p might yield behavior consistent with the old value or the new value. The Standard wouldn't require such treatment, but most implementations intended to be suitable for embedded programming tasks will offer stronger guarantees than what the Standard requires. If either the old or new value would be equally acceptable, letting the compiler use whichever is more convenient may allow more efficient code than using volatile. As it happens, however, the "optimized" code from gcc will replace the two uses of temp with separate loads of *p.

If you're using gcc with the Cortex-M0 and are at all concerned about performance or the possibility of "astonishing" behaviors, get in the habit of inspecting the compiler's output. For some kinds of loop, it might even be worth considering testing out -O0. If code makes suitable use of the register keyword, its performance can sometimes beat that of identical code processed with -O2.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • Great answer, thank you. Is there a compiler you would suggest that seems to get the details right for a Cortex-M0? – Dan Sandberg Mar 22 '21 at 21:28
  • @DanSandberg: I use the ARM-Keil MDK at work. It's rather expensive, but you could download a free limited-functionality evaluation version. – supercat Mar 22 '21 at 21:30
  • If this issue affects you, please mark the bug as affecting you here: https://bugs.launchpad.net/gcc-arm-embedded/+bug/1920818 – Dan Sandberg Mar 23 '21 at 10:37
  • @DanSandberg: This divide issue is the tip of the iceberg as far as poor Cortex-M0 code generation efficiency is concerned. Have you looked at the code gcc generates for the last `test` function above? Unless or until the maintainers of gcc become interested in trying to maximize the range of non-portable-but-useful constructs they can reliably and efficiently process with the same *semantics* as `-O0`, without creating needless weird corner cases like those in `test`, I'd regard gcc as reliable only when one uses `-O0` for parts of the code one doesn't hand-inspect after each build. – supercat Mar 23 '21 at 16:10
  • @DanSandberg: Incidentally, when computing `x*9363`, gcc uses a load of the constant followed by a multiply, but when using `x*37449` it opts to generate an eight-instruction sequence of moves, adds, and shifts. Maybe a good trade-off on an ARM with a slow multiply, but not on one with a fast multiply. – supercat Mar 23 '21 at 20:13
-1

Expanding on supercat's answer.

Feed this:

unsigned short fun ( unsigned short x )
{
    return(x/7);
}

to something with a larger multiply:

00000000 <fun>:
   0:   e59f1010    ldr r1, [pc, #16]   ; 18 <fun+0x18>
   4:   e0832190    umull   r2, r3, r0, r1
   8:   e0400003    sub r0, r0, r3
   c:   e08300a0    add r0, r3, r0, lsr #1
  10:   e1a00120    lsr r0, r0, #2
  14:   e12fff1e    bx  lr
  18:   24924925    .word   0x24924925
  

1/7 in binary (long division):

     0.001001001001001
 111)1.000000
       111 
      ==== 
         1000
          111
          ===
            1
            
        
0.001001001001001001001001001001
0.0010 0100 1001 0010 0100 1001 001001
0x2492492492...
0x24924925>>32  (rounded up)

For this to work you need a 64 bit result, you take the top half and do some adjustments, so for example:

7 * 0x24924925 = 0x100000003

and you take the top 32 bits (not completely this simple but for this value you can see it working).

The all thumbs variant multiply is 32 bits = 32 bits * 32 bits, so the result would be 0x00000003 and that does not work.

So 0x24924 which we can make 0x2493 as supercat did or 0x2492.

Now we can use the 32x32 = 32 bit multiply:

0x2492 * 7 = 0x0FFFE
0x2493 * 7 = 0x10005

Let's run with the one larger:

0x100000000/0x2493 = a number greater than 65536. so that is fine.

but:

0x3335 * 0x2493 = 0x0750DB6F
0x3336 * 0x2493 = 0x07510002
0x3335 / 7 = 0x750
0x3336 / 7 = 0x750

So you can only get so far with that approach.

If we follow the model of the arm code:

for(ra=0;ra<0x10000;ra++)
{
    rb=0x2493*ra;
    rd=rb>>16;
    rb=ra-rd;
    rb=rd+(rb>>1);
    rb>>=2;
    rc=ra/7;
    printf("0x%X 0x%X 0x%X \n",ra,rb,rc);
    if(rb!=rc) break;
}

Then it works from 0x0000 to 0xFFFF, so you could write the asm to do that (note it needs to be 0x2493 not 0x2492).

If you know the operand is not going above a certain value then you can use more bits of 1/7th to multiply against.

In any case when the compiler does not do this optimization for you then you might still have a chance yourself.

Now that I think about it I ran into this before, and now it makes sense. But I was on a full sized arm and I called a routine I compiled in arm mode (the other code was in thumb mode), and had a switch statement basically if denominator = 1 then result = x/1; if denominator = 2 then result = x/2 and so on. And then it avoided the gcclib function and generated the 1/x multiplies. (I had like 3 or 4 different constants to divide by):

unsigned short udiv7 ( unsigned short x )
{
    unsigned int r0;
    unsigned int r3;
    
    r0=x;
    r3=0x2493*r0;
    r3>>=16;
    r0=r0-r3;
    r0=r3+(r0>>1);
    r0>>=2;
    return(r0);
}

Assuming I made no mistakes:

00000000 <udiv7>:
   0:   4b04        ldr r3, [pc, #16]   ; (14 <udiv7+0x14>)
   2:   4343        muls    r3, r0
   4:   0c1b        lsrs    r3, r3, #16
   6:   1ac0        subs    r0, r0, r3
   8:   0840        lsrs    r0, r0, #1
   a:   18c0        adds    r0, r0, r3
   c:   0883        lsrs    r3, r0, #2
   e:   b298        uxth    r0, r3
  10:   4770        bx  lr
  12:   46c0        nop         ; (mov r8, r8)
  14:   00002493    .word   0x00002493

That should be faster than a generic division library routine.

Edit

I think I see what supercat has done with the solution that works:

((i*37449 + 16384u) >> 18)

We have this as the 1/7th fraction:

0.001001001001001001001001001001

but we can only do a 32 = 32x32 bit multiply. The leading zeros give us some breathing room we might be able to take advantage of. So instead of 0x2492/0x2493 we can try:

1001001001001001
0x9249
0x9249*0xFFFF = 0x92486db7

And so far it won't overflow:

    rb=((ra*0x9249) >> 18);

by itself it fails at 7 * 0x9249 = 0x3FFFF, 0x3FFFF>>18 is zero not 1.

So maybe

    rb=((ra*0x924A) >> 18);

that fails at:

    0xAAAD 0x1862 0x1861 

So what about:

    rb=((ra*0x9249 + 0x8000) >> 18);

and that works.

What about supercat's?

    rb=((ra*0x9249 + 0x4000) >> 18);

and that runs clean for all values 0x0000 to 0xFFFF:

    rb=((ra*0x9249 + 0x2000) >> 18);

and that fails here:

0xE007 0x2000 0x2001 

So there are a couple of solutions that work.

unsigned short udiv7 ( unsigned short x )
{
    unsigned int ret;
    ret=x;
    ret=((ret*0x9249 + 0x4000) >> 18);
    return(ret);
}
00000000 <udiv7>:
   0:   4b03        ldr r3, [pc, #12]   ; (10 <udiv7+0x10>)
   2:   4358        muls    r0, r3
   4:   2380        movs    r3, #128    ; 0x80
   6:   01db        lsls    r3, r3, #7
   8:   469c        mov ip, r3
   a:   4460        add r0, ip
   c:   0c80        lsrs    r0, r0, #18
   e:   4770        bx  lr
  10:   00009249    .word   0x00009249

Edit

As far as the "why" question goes, that is not a Stack Overflow question; if you want to know why gcc doesn't do this, ask the authors of that code. All we can do is speculate here and the speculation is they may either have chosen not to because of the number of instructions or they may have chosen not to because they have an algorithm that states because this is not a 64 = 32x32 bit multiply then do not bother.

Again the why question is not a Stack Overflow question, so perhaps we should just close this question and delete all of the answers.

I found this to be incredibly educational (once you know/understand what was being said).

Another WHY? question is why did gcc do it the way they did it when they could have done it the way supercat or I did it?

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168
  • Reciprocal Multiplication, a tutorial https://homepage.divms.uiowa.edu/~jones/bcd/divide.html – old_timer Mar 23 '21 at 01:55
  • gcc clearly knows how to do this (try other targets too), so I dont see this as a bug or a missed optimization I assume they simply have a go/no-go for this after a certain amount of code or if all-thumb-variants then dont try the optimization. – old_timer Mar 23 '21 at 02:01
  • This post has absolutely nothing to do with the question. The question was "why doesn't gcc do this automatically?" not "how do I do it myself?" which the OP obviously already knows or they wouldn't have asked the question. – Tom V Mar 23 '21 at 06:17
-2

The compiler can only rearrange integer expressions if it knows that the result will be correct for any input allowed by the language.

Because 7 is co-prime to 2, it is impossible to carry out dividing any input by seven with multiplying and shifting.

If you know that it is possible for the input that you intend to provide, then you have to do it yourself using the multiply and shift operators.

Depending on the size of the input, you will have to choose how much to shift so that the output is correct (or at least good enough for your application) and so that the intermediate doesn't overflow. The compiler has no way of knowing what is accurate enough for your application, or what your maximum input will be. If it allows any input up to the maximum of the type, then every multiplication will overflow.

In general GCC will only carry out division using shifting if the divisor is not co-prime to 2, that is if it is a power of two.

Tom V
  • 4,827
  • 2
  • 5
  • 22
  • For inputs in the range 0-65535 (`num` is a `uint16_t`), multiplying and shifting would yield precise results easily if the compiler were interested in such optimizations. – supercat Mar 22 '21 at 21:29
  • Yes, for uint16_t divided by 7, but not for any arbitrary constant divisor, and with this being a 32 bit platform the divisor is not usually going to be 16 bits either. The point I was trying to make is that it is a corner case which is not generally applicable. This is why no one has considered it worth their time to add it to the compiler, when in those corner cases people can just do it by hand. – Tom V Mar 22 '21 at 21:39
  • Adding support for efficient 16/16 division on a 32-bit platform would be easier, and offer more value for many embedded applications, than many of the more "clever" things gcc does. – supercat Mar 22 '21 at 22:02
  • You would normally multiply by 1/7th since 7 is not a power of 2. – old_timer Mar 23 '21 at 00:38
  • See my answer you do a multiply by the constant 1/7 in this case then there is some shifting or other that happens based on the constant. (the first shift is implied if 32 bit resisters then the shift right by 32 is implied by taking the register containing the result of the multiply). but in essence it is a multiply with a shift to get the result. both clang and gcc will generate these optimizations for targets with a multiply but no divide and you will find that in some cases it will do this even if the target has a divide as multiply is faster (or can be depends on implementation) – old_timer Mar 23 '21 at 01:53
  • You can replace the multiply as well if you dont have one, although I suspect the compiler would give up at that point as it did here with thumb and just call a library function. – old_timer Mar 23 '21 at 01:59
  • @old_timer: For dividends that are produced from uint8_t, uint16_t, or constant values in a way that's guaranteed yield a value in the range 0-65535, a simple multiply instruction will suffice. I would expect that for larger inputs the 32x32 multiply routine would probably be more efficient than the divide routine on many platforms, though some Cortex-M0 devices have "divide helper" circuitry which might nullify that advantage. – supercat Mar 23 '21 at 15:29
  • The multiply using a 32=32x32 does not suffice (x*9363)>>16 fails for many cases between 0 and 0xFFFF, As shown if that is what you are referring to. You should have seen the question perhaps years ago now about why gcc was producing a multiply when the specified isa had a divide (for an arm). chip vendors can certainly add a divider as well if they wish. This was a good/interesting question though. – old_timer Mar 23 '21 at 15:45
  • I would have to look but for the M0 or perhaps also others the chip vendor can compile for the fast multiply or the slow one, one could argue there could be or perhaps are command line options to generate the non-library code where possible. Not a project I am interested in taking on myself though, I would just do what was done above (or synthesize the divide in some other way to avoid a divide). One example would be to just keep the adc value with the times seven in there and do the rest of the math perhaps dividing it out later or not needing to. – old_timer Mar 23 '21 at 15:48
  • an 8 bit operand certainly the multiply and shift works just fine. – old_timer Mar 23 '21 at 15:49
  • @old_timer: Brain fart on my end. A correct expression would be `((i*37449 + 16384u) >> 18)` [tested for all values] – supercat Mar 23 '21 at 16:57
  • thumbs up, nice – old_timer Mar 23 '21 at 20:40
  • even better if you explain how you came up with that...ahh nevermind I get it...I think... not necessarily the 16384 though...does that pull in another digit? – old_timer Mar 23 '21 at 20:48