1

I have two functions, both are similar to this:

void Bit_Delay()
{
    //this is a tuned tight loop for 8 MHz to generate timings for 9600 baud
    volatile char z = 12;
    
    while(z)
    {
        z++;
        z++;
        z++;
        z++;
        z -= 5;
    }
}

(The second function is analogous instead it uses 18 instead of 12 for the counter).

The code works flawlessly as it is (with z appearing locally to each function internally), but I'm trying to cram a little more functionality into my executable before I hit the (horribly) limited FLASH memory available.

My thought was to promote the z variable to be a global one (a volatile static). Because these two functions are effectively atomic operations (it's a single-threaded CPU and there are no interrupts at play to interfere), I figured that these two functions could share the single variable, thus saving a tiny bit of stack manipulation.

This didn't work. It is clear that the compiler is optimising-out much of the code related to z completely! The code then fails to function properly (running far too fast), and the size of the compiled binary drops to about 50% or so.

I realised that I needed the z variable to be marked volatile to prevent the compiler from removing code it knows is counting a fixed (and thus reducible to a constant) number each time.

Question:

Can I optimise this any further, and trick the compiler into keeping both functions intact? I'm compiling with "-Os" (optimise for small binary).

Here's the entire program verbatim for those playing along at home...

#include <avr/io.h>

#define RX_PIN (1 << PORTB0) //physical pin 3
#define TX_PIN (1 << PORTB1) //physical pin 1

void Bit_Delay()
{
    //this is a tuned tight loop for 8 MHz to generate timings for 9600 baud
    volatile char z = 12;
    
    while(z)
    {
        z++;
        z++;
        z++;
        z++;
        z -= 5;
    }
}

void Serial_TX_Char(char c)
{
    char i;
    
    //start bit
    PORTB &= ~TX_PIN;
    Bit_Delay();
    
    for(i = 0 ; i < 8 ; i++)
    {
        //output the data bits, LSB first
        if(c & 0x01)
            PORTB |= TX_PIN;
        else
            PORTB &= ~TX_PIN;
        
        c >>= 1;
        Bit_Delay();
    }

    //stop bit  
    PORTB |= TX_PIN;
    Bit_Delay();
}

char Serial_RX_Char()
{
    char retval = 0;
    volatile char z = 18; //1.5 bits delay

    //wait for idle high
    while((PINB & RX_PIN) == 0)
    {}
    
    //wait for start bit falling-edge
    while((PINB & RX_PIN) != 0)
    {}

    //1.5 bits delay
    while(z)
    {
        z++;
        z++;
        z++;
        z++;
        z -= 5;
    }

    for(z = 0 ; z < 8 ; z++)
    {
        retval >>= 1; //make space for the new bit
        retval |= (PINB & RX_PIN) << (8 - RX_PIN); //get the bit and store it
        Bit_Delay();
    }
    
    return retval;      
}

int main(void)
{
    CCP = 0xd8; //protection signature for clock registers (see datasheet)
    CLKPSR = 0x00; //set the clock prescaler to "div by 1"
    DDRB |= TX_PIN;
    PORTB |= TX_PIN; //idle high
        
    while (1) 
        Serial_TX_Char(Serial_RX_Char() ^ 0x20);
}

The target CPU is an Atmel ATTiny5 microcontroller, the code above uses up 94.1% of the FLASH memory! If you connect to the chip using a serial port at 9600 Baud, 8N1, you can type characters in and it returns them with bit 0x20 flipped (uppercase to lowercase and vice-versa).

This is not a serious project of course, I'm just experimenting to see how much functionality I could cram into this chip. I'm not going to bother with rewriting this in assembly, I seriously doubt I could do a better job than GCC's optimiser!

EDIT

@Frank asked about the IDE / compiler I'm using...

Microchip Studio (7.0.2542)

The "All Options" string that is passed to the compiler avr-gcc...

-x c -funsigned-char -funsigned-bitfields -DDEBUG  -I"C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATtiny_DFP\1.8.332\include"  -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -g2 -Wall -mmcu=attiny5 -B "C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATtiny_DFP\1.8.332\gcc\dev\attiny5" -c -std=gnu99 -MD -MP -MF "$(@:%.o=%.d)" -MT"$(@:%.o=%.d)" -MT"$(@:%.o=%.o)" 
Wossname
  • 123
  • 5
  • 1
    For reference: https://gcc.godbolt.org/z/9qsd5avzb –  Jun 01 '21 at 18:23
  • Use some combination of `#pragma` to disable optimizations for certain parts of your code. – Blindy Jun 01 '21 at 18:26
  • @Frank, Oh I see. Hmm, perhaps I'll try it in ASM then. :) – Wossname Jun 01 '21 at 18:27
  • 2
    You could change the signature of your function to take a paramter for the initialization of the loop variable. `void Bit_Delay(char count) { volatile char z = count; /* ... */ }` and call it `Bit_Delay(12);` or `Bit_Delay(18);` – Bodo Jun 01 '21 at 18:30
  • @Wossname I know you are joking, but if you are trying to shave bytes, being able to change your code and immediately see how that affects the assembly is going to be the way you want to go about it. (you can change the source in the left pane in the link I posted) –  Jun 01 '21 at 18:30
  • It can also help validate assumptions. For example, in https://gcc.godbolt.org/z/sKdz3h8oP, it looks like the code related to z is not, in fact, optimized out (but the resulting assembly sure looks suspiciously small at a glance.) –  Jun 01 '21 at 18:32
  • @Frank, I kind of wasn't really joking, I was surprised at how much assembly code was generated, especially where `z++` is concerned. That website is pretty badass though, thanks for the link to that, appreciated. – Wossname Jun 01 '21 at 18:35
  • 1
    I'm confused by something. You are running out of RAM or flash? You are trying to combine two one byte RAM variables to help a program which is running out of FLASH (program?) memory? Also, the variables as written are on the stack, which is preallocated RAM, and a static variable is not on the stack -- if you get it to work, it might take more RAM. – Basya Jun 01 '21 at 18:37
  • 1
    @Basya, RAM isn't an issue (I have a generous 32 bytes to roam around in), I'm just trying to experiment with how the way I write the code affects the final binary size (which is subject to the FLASH constraint). I figured that the global is allocated once and then simply referenced multiple times, which might be smaller than having two copies of it. I'm far from an expert at low-level compiler antics, I'm learning :) – Wossname Jun 01 '21 at 18:43
  • @Frank, regarding `++z`... makes no difference on my real compiler, it's still 94.1%. This is fascinating stuff. I'm going to have to dust off my AVR assembly manual. – Wossname Jun 01 '21 at 18:46
  • What is your real compiler then? It'd be easier to help you if we could experiment in the same context. –  Jun 01 '21 at 18:47
  • 1
    @Frank, I'll add an edit to the bottom of the Q... – Wossname Jun 01 '21 at 18:53

2 Answers2

1

I question the following assumption:

This didn't work. It is clear that the compiler is optimising-out much of the code related to z completely! The code then fails to function properly (running far too fast), and the size of the compiled binary drops to about 50% or so.

Looking at https://gcc.godbolt.org/z/sKdz3h8oP, it seems like the loops are actually being performed, however, for whatever reason each z++, when using a global volatile z goes from:

subi r28,lo8(-(1))
sbci r29,hi8(-(1))
ld r20,Y
subi r28,lo8((1))
sbci r29,hi8((1))
subi r20,lo8(-(1))
subi r28,lo8(-(1))
sbci r29,hi8(-(1))
st Y,r20
subi r28,lo8((1))
sbci r29,hi8((1))

to:

lds r20,z
subi r20,lo8(-(1))
sts z,r20

You will need to recalibrate your 12, 18, and 5 constants to get your baud rate correct (since fewer instructions are executed in each loop), but the logic is there in the compiled version.

To be clear: This looks really weird to me, the local volatile version is clearly not being compiled correctly. I did find an old gcc bug along these lines: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33970, but it seems to not cover the local variable case.

  • 1
    Wow! That second code sample is what I would have expected the `z++;` to look like (although the subtracting a negative 1 is a bit peculiar to my mind). The compiler's original output (the long string of strange subtractions) seems completely bizarre to me. Thank you for demonstrating that my instinict was right (global volatile z) but that my interpretation of the actual results was wrong (I had assumed that it had entirely failed). Amazing how elevating one simple char to a global can so drastically improve the code size. I appreciate your help. Many thanks. – Wossname Jun 01 '21 at 18:49
  • @Wossname It would appear to be a compiler bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33970 –  Jun 01 '21 at 18:54
  • Nasty. In that case there would likely be a bug in my code anyway, if I did elevate z to global volatile. I'm re-using z as a counter and then calling the function in the same loop. Well that's why it works without promotion, and compiles small with promotion, but most likely would fail in real operation if I was to re-calibrate my delay loops as you suggested (the loop would probably run forever I suppose). Not that any of that matters, really. I've learned how broken avr-gcc is. I'm going to hand code this all over again in AVR assembly and cut out the "optimiser" altogether. – Wossname Jun 01 '21 at 19:02
  • @Wossname No one is forcing you to use z as the loop counter: https://gcc.godbolt.org/z/erqr1WvM7 –  Jun 01 '21 at 19:04
  • I realise. But just look at the madness of the way it uses `lds r20,z` before EACH consecutive `z` increment and `sts z,r20` after it. There are no gaps between, so it's storing and loading it immediately, when it's entirely needless (unless there's some obscure AVR foible I'm not aware of that requires this). Not to worry. I'm fully happy with your explanation, and I'm not going to persue this C code any further. – Wossname Jun 01 '21 at 19:09
  • that's what the `volatile` does. the ++ is forced to increment relative to the current value. You are only doing this to waste time anyways. –  Jun 01 '21 at 19:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233198/discussion-between-wossname-and-frank). – Wossname Jun 01 '21 at 19:14
0

Can I optimise this any further,

Of course; code like this is extremely expensive — and fagile:

volatile char z = 12;
    
while(z)
{
    z++;
    z++;
    z++;
    z++;
    z -= 5;
}

It's expensive because your are asking for code bloat just to waste some cycles. And it's fragile because minimal changes to the code might change the timings. Apart from that it triggers stack frames because local volatile vars will live in the stack frame.

To make things worse, you are using volatile char z as a loop variable!

Why to use _delay_ms and friends

AVR-LibC provides delay routines like _delay_us and delay_ms in <util/delay.h>. Advantage is that:

  • A specified amount of time is wasted, which is passed as a parameter. (The routines might waste more real time than expected if interrupts are on.)

  • Code size is minimal due to inline assembly, compiler built-ins like __builtin_avr_delay_cycles and code folding.

  • No more need for magic numbers like 12 or 18 in your code.

The delay time must evaluate to a compile-time constant, and optimization must be turned on1. Canonical use case is to compute delay time using F_CPU, baud rate etc. Suppose we want to delay x * 1/BAUD seconds, which is x * 1000000 / BAUD µs.

So let's change Bit_Delay to the following code, where we add -D F_CPU=8000000 to the command line options so that the delay routines have it available:

#define BAUD 9600

__attribute__((always_inline))
static inline void Bit_Delay (double x)
{
    _delay_us (x * 1000000 / BAUD);
}

Then use it like Bit_Delay (1). Changing the 4 usages to that, the code size drops from 480 Bytes to 360 Bytes.

Also adjusting the wait for 1.5 bits to Bit_Delay (1.5) and fixing the loop variable to not be volatile, the code size drops to 180 Bytes.

Functions Serial_RX_Char and Serial_TX_Char are just called once statically, so the compiler can inline them provided we make them static. This reduces the code size further to 170 Bytes, 44 bytes of which are start-up code and vector table. Moreover, we do no more need stack frames (which were triggered by local volatile vars), and the function calls are inlined, which saves RAM. Not unimportant on a device like ATtiny5 with just 32 Bytes of RAM.

FYI, the (inlined) delay code is compiled as:

    ldi r20,lo8(208)
    ldi r21,hi8(208)
1:  subi r20,1
    sbci r21,0
    brne 1b
    nop

Magic 208 is basically F_CPU / BAUD / 4 folded by the compiler, where the division by 4 is because one turn of the delay loop takes 4 ticks.

Why not to use _delay_ms and friends

Busy-waiting is a waste of time and energy. For very short timing issues it might be in order, longer busy-wait may destroy timings of other parts of the code, because it blocks their execution. If possible, use timers + interrupts for that purpose; it saved energy (when sleep can be used when idling) and does not delay execution of unrelated code.


1So that code folding works as expected. Otherwise, the delay code will complain:

util/delay.h:112:3: warning: Compiler optimizations disabled; functions from <util/delay.h> won't work as designed

emacs drives me nuts
  • 2,785
  • 13
  • 23