Speed of nested if vs for loop

Question

I recently came across this piece of code in an interrupt service routine (ISR):

#define MAX_CHANNELS 4
static uint16_t volatile* ADCVALS[MAX_CHANNELS] = {
    &ADC1BUF0, &ADC1BUF1, &ADC1BUF2, &ADC1BUF3
};
static uint8_t CHANNELS = 0;
static uint16_t volatile* volatile BUFFER_IDX[MAX_CHANNELS];

void __attribute__((interrupt, no_auto_psv)) _AD1Interrupt(void) {
    *(BUFFER_IDX[0]++) = *ADCVALS[0];
    if (CHANNELS >= 1) {
        *(BUFFER_IDX[1]++) = *ADCVALS[1];
        if (CHANNELS >= 2) {
            *(BUFFER_IDX[2]++) = *ADCVALS[2];
             if (CHANNELS >= 3) {
                *(BUFFER_IDX[3]++) = *ADCVALS[3];
            }
        }
    }
}

It copies between 1-4 register values into memory, depending on the value of CHANNELS, which is a value between 0-3 which is set elsewhere in the program via a setter function.

I found the nested if's extremely ugly and changed it to this:

int i;
for (i = 0; i <= CHANNELS; i++) {
    *(BUFFER_IDX[i]++) = *ADCVALS[i];
}

which promptly broke the ISR. This is an embedded system, PIC24 architecture, 64 MHz clock. The ISR is severely time constrained and must finish within 1 µs. The for loop is apparently too slow, while the nested if is fast enough.

My question, then, is two-fold:

Is there a less ugly way to do what the nested if clauses do, without slowing down the ISR?
Why is the for loop so much slower? I would have expected the compiler (xc16) to be smart enough to generate similar asm for both (at -O2).

For arrays, a loop condition like `i <= CHANNELS` is almost always wrong. Usually something like `i < CHANNELS` is correct. — Some programmer dude, Oct 10 '22 at 08:18
Also, have you looked at the generated assembly code? What *is* `CHANNELS`? Is it a macro (common convention is to use all-uppercase words for macros, not for variables) or an enumeration? — Some programmer dude, Oct 10 '22 at 08:20
`CHANNELS` is a value between 0-3, where 0 means copy one value, and 3 means copy four. `BUFFER_IDX` is an array of pointers to a preallocated memory region. The region is split into 1-4 sections, depending on the value of `CHANNELS`. So `i <= CHANNELS` is correct, since the loop should run for `CHANNELS == 3` but not for `CHANNELS == 4`. — abessman, Oct 10 '22 at 08:28
The final if condition should probably be `if (CHANNELS == 3)` instead of `if (CHANNELS >= 3)`, since the only valid values for `CHANNELS` is 0-3. In that case, I believe the two code examples are equivalent, no? — abessman, Oct 10 '22 at 08:31
As already mentioned, you should look and compare the generated assembly. It is hard to predict performance by looking at the C code alone. If you have issues in high performance code, you cannot just blindly trust compiler to do the right thing. — user694733, Oct 10 '22 at 08:34
This question doesn't make much sense and cannot be answered without knowing the type/declarations of `BUFFER_IDX` and `CHANNELS` and so on. Start with disassembling the original code and take it from there. The only thing I can tell you from this code is that you shouldn't be using naive types such as `int` when coding embedded systems. In case of PIC, using `int` may make the code needlessly slow for nothing gained and the compiler might not be able to optimize away such basic mistakes. — Lundin, Oct 10 '22 at 10:38
@abessman As for why `int` is slow, it's pretty obvious. It's an 8-bit MCU so swap it for `uint8_t`. — Lundin, Oct 11 '22 at 06:59

score 5 · Accepted Answer · answered Oct 10 '22 at 08:42

for (int i = 0; i <= CHANNELS; i++) {
    *(BUFFER_IDX[i]++) = *ADCVALS[i];
}

Is

        mov     DWORD PTR [rbp-4], 0
        jmp     .L2
.L3:
        mov     eax, DWORD PTR [rbp-4]
        cdqe
        mov     rax, QWORD PTR [rbp-80+rax*8]
        mov     edx, DWORD PTR [rax]
        mov     eax, DWORD PTR [rbp-4]
        cdqe
        mov     rax, QWORD PTR [rbp-48+rax*8]
        lea     rsi, [rax+4]
        mov     ecx, DWORD PTR [rbp-4]
        movsx   rcx, ecx
        mov     QWORD PTR [rbp-48+rcx*8], rsi
        mov     DWORD PTR [rax], edx
        add     DWORD PTR [rbp-4], 1
.L2:
        mov     eax, DWORD PTR [rbp-4]
        cmp     eax, DWORD PTR [rbp-8]
        jle     .L3

And

*(BUFFER_IDX[0]++) = *ADCVALS[0];
if (CHANNELS >= 1) {
    *(BUFFER_IDX[1]++) = *ADCVALS[1];
    if (CHANNELS >= 2) {
        *(BUFFER_IDX[2]++) = *ADCVALS[2];
        if (CHANNELS >= 3) {
            *(BUFFER_IDX[3]++) = *ADCVALS[3];
        }
    }
}

Is

mov     rax, QWORD PTR [rbp-80]
mov     edx, DWORD PTR [rax]
mov     rax, QWORD PTR [rbp-48]
lea     rcx, [rax+4]
mov     QWORD PTR [rbp-48], rcx
mov     DWORD PTR [rax], edx
cmp     DWORD PTR [rbp-4], 0
jle     .L2
mov     rax, QWORD PTR [rbp-72]
mov     edx, DWORD PTR [rax]
mov     rax, QWORD PTR [rbp-40]
lea     rcx, [rax+4]
mov     QWORD PTR [rbp-40], rcx
mov     DWORD PTR [rax], edx
cmp     DWORD PTR [rbp-4], 1
jle     .L2
mov     rax, QWORD PTR [rbp-64]
mov     edx, DWORD PTR [rax]
mov     rax, QWORD PTR [rbp-32]
lea     rcx, [rax+4]
mov     QWORD PTR [rbp-32], rcx
mov     DWORD PTR [rax], edx
cmp     DWORD PTR [rbp-4], 2
jle     .L2
mov     rax, QWORD PTR [rbp-56]
mov     edx, DWORD PTR [rax]
mov     rax, QWORD PTR [rbp-24]
lea     rcx, [rax+4]
mov     QWORD PTR [rbp-24], rcx
mov     DWORD PTR [rax], edx

How you can see nested if will do less jumps but current compilers can optimize it and with -O3 flag you will get something like this

        mov     eax, DWORD PTR [rsp+12]
        test    eax, eax
        js      .L2
        mov     rax, QWORD PTR [rsp+48]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rsp+16]
        mov     DWORD PTR [rax], edx
        mov     eax, DWORD PTR [rsp+12]
        test    eax, eax
        jle     .L2
        mov     rax, QWORD PTR [rsp+56]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rsp+24]
        mov     DWORD PTR [rax], edx
        mov     eax, DWORD PTR [rsp+12]
        sub     eax, 1
        jle     .L2
        mov     rax, QWORD PTR [rsp+64]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rsp+32]
        mov     DWORD PTR [rax], edx
        mov     eax, DWORD PTR [rsp+12]
        cmp     eax, 2
        jle     .L2
        mov     rax, QWORD PTR [rsp+72]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rsp+40]
        mov     DWORD PTR [rax], edx
        mov     eax, DWORD PTR [rsp+12]
.L2:

That has +- same performance as nested if-s

Thank you for the explanation. I suspected `-O3` might make the for loop faster, but I'm using the free version of xc16 which only goes to `-O2`. I suppose I'm stuck the the nested if's for now. — abessman, Oct 10 '22 at 08:55
@abessman on -O2 it is loop and it really little bit slower, but code more readable — TheRyuTeam, Oct 10 '22 at 09:57
Performance of some x86 and a PIC are vastly different things. It's completely senseless to optimize in favour of branch prediction for example. — Lundin, Oct 10 '22 at 10:33

Fe2O3 · Answer 2 · 2022-10-12T23:15:34.250

2

I'm gonna go out on a limb ('cuz I don't usually get this close to the metal).

switch( CHANNELS ) {
    case 3: *(BUFFER_IDX[3]++) = *ADCVALS[3]; /* Fallthroughs all the way down */
    case 2: *(BUFFER_IDX[2]++) = *ADCVALS[2];
    case 1: *(BUFFER_IDX[1]++) = *ADCVALS[1];
    default:*(BUFFER_IDX[0]++) = *ADCVALS[0];
}

This is 'prettier', and trades-off branching with the fetch/assign operations.
Would be nice if someone familiar with Assembly and performance code would please (kindly and gently) evaluate this alternative.

EDIT
New information about the relationship of ISR duration and # of 'channels' read suggests this might be an alternative that is easier on the eyes.

void __attribute__((interrupt, no_auto_psv)) _AD1Interrupt(void) {
    *(BUFFER_IDX[0]++) = *ADCVALS[0]; if (CHANNELS == 0) goto doneHere;
    *(BUFFER_IDX[1]++) = *ADCVALS[1]; if (CHANNELS == 1) goto doneHere;
    *(BUFFER_IDX[2]++) = *ADCVALS[2]; if (CHANNELS == 2) goto doneHere;
    *(BUFFER_IDX[3]++) = *ADCVALS[3];
doneHere:
    ; // a "sequence point"
}

One has to overcome one's aversion to goto. The compiler is indifferent. Coders should be too.

edited Oct 12 '22 at 23:15

answered Oct 10 '22 at 10:15

Fe2O3

6,077
2
4
20

Be careful on GCC warnings too. – unalignedmemoryaccess Oct 10 '22 at 10:17
@tilz0R Yes, this is "rough"... I only use an old compiler, not as 'finicky' as current versions.... Just wondering about the C -> assembly of this. Probably worse than a modern, clever optimiser would achieve... Noted, though. Thank you... – Fe2O3 Oct 10 '22 at 10:19
There's no obvious reason why this would be faster or slower than a for loop. Older embedded systems compilers are known to optimize switch statements poorly, however. The PIC compiles in particular always had a real nasty reputation in regards to pretty much everything, including optimizations. Only one way to tell and that's to disassemble with the compiler used. – Lundin Oct 10 '22 at 10:46
1

I actually tried this already. It is faster, but not as fast as the nested if's (and not fast enough, unfortunately). – abessman Oct 10 '22 at 11:42
@abessman I think this is not place with bad performance – TheRyuTeam Oct 10 '22 at 17:17
This is not the same as the OP's code when `CHANNELS` is greater than 3. – John Bollinger Oct 10 '22 at 17:40
1

@JohnBollinger Moot point now that th OP has commented here that this is "not fast enough", but... 3rd comment under the original question, OP writes "CHANNELS is a value between 0-3,.." Just going on the clues at the time... All good... :-) – Fe2O3 Oct 10 '22 at 20:45

score 1 · Answer 3 · answered Oct 10 '22 at 17:36

I did as several commenters suggested and looked at the assembly generated by the compiler for the different codes. I also realized that the 1 µs time constraint is only valid for the case where CHANNELS == 0. For values greater than zero, the program bottlenecks elsewhere, which relaxes the time constraint on the ISR somewhat.

Here's the asm for the nested if's:

mov _BUFFER_IDX,w0
mov _ADC1BUF0,w1
mov w1,[w0++]
mov w0,_BUFFER_IDX
mov.b   _CHANNELS,WREG
bra z,.L79
mov _BUFFER_IDX+2,w1
mov _ADC1BUF1,w2
mov w2,[w1++]
mov w1,_BUFFER_IDX+2
sub.b   w0,#1,[w15]
bra leu,.L79
mov _BUFFER_IDX+4,w1
mov _ADC1BUF2,w4
mov w4,[w1++]
mov w1,_BUFFER_IDX+4
sub.b   w0,#2,[w15]
bra z,.L79
mov _BUFFER_IDX+6,w0
mov _ADC1BUF3,w5
mov w5,[w0++]
mov w0,_BUFFER_IDX+6
.L79
retfie

And here it is for the for loop:

mov #_CHANNELS,w6
mov.b   [w6],w6
ze  w6,w6
mov #_ADCVALS-2,w3
clr w0
mov #_BUFFER_IDX,w5
.L79:
add w0,w0,w2
add w5,w2,w2
mov [w2],w1
mov [++w3],w4
mov [w4],[w1++]
mov w1,[w2]
inc w0,w0
sub w0,w6,[w15]
bra leu,.L79
retfie

These are all single-cycle operations, except for retfie (return from interrupt), which takes 6 cycles. For the case CHANNELS == 0, the nested if's use a total of 12 cycles, while the for loop uses 21.

For the switch-case suggested in another answer:

switch( CHANNELS ) {
    case 3: *(BUFFER_IDX[3]++) = *ADCVALS[3]; /* Fallthroughs all the way down */
    case 2: *(BUFFER_IDX[2]++) = *ADCVALS[2];
    case 1: *(BUFFER_IDX[1]++) = *ADCVALS[1];
    default:*(BUFFER_IDX[0]++) = *ADCVALS[0];
}

the asm looks like this:

mov.b   _CHANNELS,WREG
sub.b   w0,#2,[w15]
bra z,.L81
sub.b   w0,#3,[w15]
bra z,.L82
sub.b   w0,#1,[w15]
bra z,.L80
.L79:
mov _BUFFER_IDX,w0
mov _ADC1BUF0,w5
mov w5,[w0++]
mov w0,_BUFFER_IDX
.L82:
mov _BUFFER_IDX+6,w0
mov _ADC1BUF3,w1
mov w1,[w0++]
mov w0,_BUFFER_IDX+6
.L81:
mov _BUFFER_IDX+4,w0
mov _ADC1BUF2,w2
mov w2,[w0++]
mov w0,_BUFFER_IDX+4
.L80:
mov _BUFFER_IDX+2,w0
mov _ADC1BUF1,w4
mov w4,[w0++]
mov w0,_BUFFER_IDX+2
bra .L79
retfie

which uses 18 cycles for CHANNELS == 0.

For curiosity, I also tested a do-while loop:

uint16_t i = 0;
do {
    *(BUFFER_IDX[i]++) = *ADCVALS[i];
} while (i++ < CHANNELS);

which turns out to be the slowest of the bunch at 22 cycles:

mov #_ADCVALS-2,w3
mov #_CHANNELS,w6
mov.b   [w6],w6
ze  w6,w6
inc w6,w6
clr w0
mov #_BUFFER_IDX,w5
.L79:
add w0,w0,w2
add w5,w2,w2
mov [w2],w1
mov [++w3],w4
mov [w4],[w1++]
mov w1,[w2]
inc w0,w0
sub w0,w6,[w15]
bra nz,.L79
retfie

I am a bit surprised that a difference of only 9 cycles is enough to break the time constraint, but I suppose the ISR must have been only just fast enough, so those extra cycles pushed it over the edge.

Speed of nested if vs for loop

3 Answers3