code optimalization and loop unrolling

Question

I am trying to get familiar with programming in assembler. At the beginning I chose random code and tried to update it. Also I read some things about loop unrolling but I do not really know where to start.

This is my code that I already modified a bit:

0000: 4401000C |            | ADDI R0, 0x000C, R1
0004: 00000000 |            | NOP  
0008: 00000000 |            | NOP  
000C: 0C220000 | loop       | LDW  R2, 0x0000(R1)
0010: 00000000 |            | NOP  
0014: 00000000 |            | NOP  
0018: 1C411000 |            | ADD  R2, R1, R2
001C: 00000000 |            | NOP  
0020: 00000000 |            | NOP  
0024: 4C420004 |            | MULI R2, 0x0004, R2
0028: 00000000 |            | NOP  
002C: 00000000 |            | NOP  
0030: 18220040 |            | STW  R2, 0x0040(R1)
0034: 48210008 |            | SUBI R1, 0x0008, R1
0038: 00000000 |            | NOP  
003C: 00000000 |            | NOP  
0040: 0C230004 |            | LDW  R3, 0x0004(R1)
0044: 00000000 |            | NOP  
0048: 00000000 |            | NOP  
004C: 18230044 |            | STW  R3, 0x0044(R1)
0050: 7C01FFB8 |            | BRGE R1, loop
0054: 00000000 |            | NOP  
0058: 00000000 |            | NOP  
005C: 7000FFFC | halt       | BRZ  R0, halt
0060: 00000000 |            | NOP  
0064: 00000000 |            | NOP

You might want to ask an actual question, it's customary around here — Leeor, May 12 '14 at 17:15
i think i figure out how to use loop unrolling but still i do not know how to improve the code — NULLexit, May 12 '14 at 21:31
"That's a lot of NOPs... – twalberg". I was wondering the same. There should be no beed for those. AVR is not that heavily pipelined. — turboscrew, May 12 '14 at 22:46
So why all the NOPs? Occasionally they're used for alignment. But that's far too many you got there. — Mysticial, May 12 '14 at 23:03

score 0 · Answer 1 · answered May 12 '14 at 22:53

0

Loop unrolling is writin the code out if it's known to be a short loop. That is to save the looping overhead with especially heavily pipelined processors to which branching takes long time (flushing and refilling the pipeline).

Basically: instead of

for (i=0; i<3; i++)
{
   a[i] = 0;
}

you do simply:

a[0] = 0;
a[1] = 0;
a[2] = 0;

I don't think you gain anything by unrolling your loop.

answered May 12 '14 at 22:53

turboscrew

676
4
13

This is more common on processor architectures that have parallel execution units such as a PowerPC or Itanium. The PowerPC can execute several integer instructions in a single clock cycle if the instructions are laid out correctly. Whether this improves performance is heavily dependent on what you're doing. Measure, measure, measure. – jbruni May 12 '14 at 23:03
Yep, I forgot to mention that. I'd have referred to TI C64X DSP core. – turboscrew May 13 '14 at 06:03

score 0 · Answer 2 · answered May 13 '14 at 21:44

i read a little bit more about loop unrolling and i think i get it. What do you think about following code?

0000: 4401000C |            | ADDI R0, 0x000C, R1
0004: 00000000 |            | NOP  
0008: 00000000 |            | NOP  
000C: 0C220000 | loop       | LDW  R2, 0x0000(R1)
0010: 0C24FFF8 |            | LDW  R4, 0xFFF8(R1)
0014: 0C23FFFC |            | LDW  R3, 0xFFFC(R1)
0018: 0C25FFF4 |            | LDW  R5, 0xFFF4(R1)
001C: 1C822000 |            | ADD  R4, R2, R4
0020: 1C411000 |            | ADD  R2, R1, R2
0024: 48210008 |            | SUBI R1, 0x0008, R1
0028: 48260008 |            | SUBI R1, 0x0008, R6
002C: 4C420004 |            | MULI R2, 0x0004, R2
0030: 4C840004 |            | MULI R4, 0x0004, R4
0034: 18230044 |            | STW  R3, 0x0044(R1)
0038: 18C50044 |            | STW  R5, 0x0044(R6)
003C: 18220048 |            | STW  R2, 0x0048(R1)
0040: 18C40048 |            | STW  R4, 0x0048(R6)
0044: 00000000 |            | NOP  
0048: 00000000 |            | NOP  
004C: 7000FFFC | halt       | BRZ  R0, halt

code optimalization and loop unrolling

2 Answers2