How do I optimize a block copy and right shift + saturate to max=5, for Cortex-M3

Question

Basically, I need to make this piece of code more efficient either by reducing the size of the overall code to reduce memory size or make it more efficient in how it runs. I am using Thumb 2 as well as the Cortex-M3.

I have tried reducing the amount of MOV Functions used but while that does reduce the overall code size, because of the way the code works, it needs every individual piece to take and store the results in the registers, so I am kind of stumped for how I can improve it. The code is at it's default state at the moment.

THUMB
  AREA RESET, CODE, READONLY
  EXPORT  __Vectors
  EXPORT Reset_Handler
__Vectors 
  DCD 0x00180000     ; top of the stack 
  DCD Reset_Handler  ; reset vector - where the program starts

  AREA 2a_Code, CODE, READONLY
Reset_Handler
  ENTRY

num_words EQU (end_source-source)/4  ; number of words to copy

start   
  LDR r0,=source     ; point to the start of the area of memory to copy from
  LDR r1,=dest       ; point to the start of the area of memory to copy to
  MOV r2,#num_words  ; get the number of words to copy

  ; find out how many blocks of 8 words need to be copied - it is assumed
  ; that it is faster to load 8 data items at a time, rather than load
  ; individually
block
  MOVS r3,r2,LSR #3  ; find the number of blocks of 8 words
  BEQ individ        ; if no blocks to copy, just copy individual words

  ; copy and process blocks of 8 words 
block_loop
  LDMIA r0!,{r5-r12}  ; get 8 words to copy as a block
  MOV r4,r5           ; get first item
  BL data_processing  ; process first item 
  MOV r5,r4           ; keep first item
  MOV r4,r6           ; get second item
  BL data_processing  ; process second item 
  MOV r6,r4           ; keep second item
  MOV r4,r7           ; get third item
  BL data_processing  ; process third item
  MOV r7,r4           ; keep third item  
  MOV r4,r8           ; get fourth item
  BL data_processing  ; process fourth item 
  MOV r8,r4           ; keep fourth item
  MOV r4,r9           ; get fifth item
  BL data_processing  ; process fifth item
  MOV r9,r4           ; keep fifth item  
  MOV r4,r10          ; get sixth item
  BL data_processing  ; process sixth item 
  MOV r10,r4          ; keep sixth item
  MOV r4,r11          ; get seventh item
  BL data_processing  ; process seventh item
  MOV r11,r4          ; keep seventh item 
  MOV r4,r12          ; get eighth item
  BL data_processing  ; process eighth item
  MOV r12,r4          ; keep eighth item  
  STMIA r1!,{r5-r12}  ; copy the 8 words 
  SUBS r3,r3,#1       ; move on to the next block
  BNE block_loop      ; continue until last block reached

  ; there may now be some data items available (fewer than 8)
  ; find out how many of these individual words need to be copied 
individ
  ANDS r3,r2,#7   ; find the number of words that remain to copy individually
  BEQ exit        ; skip individual copying if none remains

  ; copy the excess of words
individ_loop
  LDR r4,[r0],#4      ; get next word to copy
  BL data_processing  ; process the item read
  STR r4,[r1],#4      ; copy the word 
  SUBS r3,r3,#1       ; move on to the next word
  BNE individ_loop    ; continue until the last word reached

  ; languish in an endless loop once all is done
exit    
  B exit

  ; subroutine to scale a value by 0.5 and then saturate values to a maximum of 5 
data_processing
  CMP r4,#10           ; check whether saturation is needed
  BLT divide_by_two    ; if not, just divide by 2
  MOV r4,#5            ; saturate to 5
  BX lr
divide_by_two
  MOV r4,r4,LSR #1     ; perform scaling
  BX lr

  AREA 2a_ROData, DATA, READONLY
source  ; some data to copy
  DCD 1,2,3,4,5,6,7,8,9,10,11,0,4,6,12,15,13,8,5,4,3,2,1,6,23,11,9,10 
end_source

  AREA 2a_RWData, DATA, READWRITE
dest  ; copy to this area of memory
  SPACE end_source-source
end_dest
  END

Basically I need the code to store the results in every register, while also being either reduced in size or made to execute quicker. Thanks for your help.

There is a lot to optimise about this code. I don't have time to write an answer right now, but I might do so later. — fuz, Apr 13 '19 at 17:50
@fuz Thanks, I’d love to hear back. It doesn’t really have to be major changes or anything, if you don’t want to, just as long as it improves the performance. — Hysteria103, Apr 13 '19 at 21:03
Cortex-m3 doesn't have a code cache; all those taken branches are going to suck for performance and not help your code reduce I-cache misses, because there's no such thing. Your probably want to inline your `data_processing` and predicate it with an `IT`. M3 has some "saturation" arithmetic instructions, but that's probably just stuff like add with signed saturation instead of wraparound, not a signed `min` you could to clamp to a max of +5. At 1 instruction per clock, M3 doesn't care much about hiding latency (other than load/store?), so those MOV instructions are killing you. — Peter Cordes, Apr 14 '19 at 07:46
If this loop is performance-critical for your application, there's always the option of using a chip with SIMD, e.g. ARM docs mention [Cortex-R52 + NEON](https://developer.arm.com/architectures/instruction-sets/dsp-extensions) as the "medium" option short of Cortex-A + NEON. Or if you can pack your data down to bytes or 16-bit instead of words (since the max value is only +5? So 32 bits is overkill for this step; maybe just saturate as you write it?) Cortex-M4 with DSP extensions will let you treat a 32-bit integer reg as a 4x8 SIMD vector with some instructions. — Peter Cordes, Apr 14 '19 at 07:51
@PeterCordes Thanks for the suggestion, but unfortunately I have to improve the Performance of the code using the M3, I can't change that. So how do I go about doing multiple instructions per clock? Thank you for your answer. — Hysteria103, Apr 14 '19 at 11:17
You literally can't. Cortex-M3 is a scalar pipeline. What you *can* do is get the work done in fewer instructions by inlining `data_processing` like other commenters suggested, so you don't have to move to `r4` and back around it. And probably make it branchless, because a taken branch costs as much for a pipeline reload as running one extra instruction. — Peter Cordes, Apr 14 '19 at 15:48
[Apparently](https://www.reddit.com/r/asm/comments/bfhkdh/optimization_help_i_have_a_program_which_copies/eldrbpj/), this is schoolwork. OP, do not cheat! It's never going to end well! — fuz, Apr 20 '19 at 22:28

fuz · Accepted Answer · 2019-04-15T12:18:42.117

Here is a slightly optimised version of your main loop. Given that you are programming for a Cortex M3, there is no real possibility for super-scalar or SIMD processing as your CPU does not support it. The main differences between this and your code are:

all relevant functions are inlined
the logic has been optimised a little bit
useless move instructions have been omitted

This code runs in 10 cycles per table entry plus a few instructions for the initial branch plus eventual branch mispredictions.

        .syntax unified
        .thumb

        @ r0: source
        @ r1: destination
        @ r2: number of words to copy
        @ the number in front of the comment is the number
        @ of cycles needed to execute the instruction
block:  cbz r2, .Lbxlr          @ 2 return if nothing to copy

.Loop:  ldmia r0!, {r3}         @ 2 load one item from source
        cmp r3, #10             @ 1 need to scale?
        ite lt                  @ 1 if r3 < 10:
        lsrlt r3, r3, #1        @ 1 then r3 >>= 1
        movge r3, #5            @ 1 else r3 = 5
        stmia r1!, {r3}         @ 2 store to destination
        subs r2, r2, #1         @ 1 decrement #words
        bne .Loop               @ 1 continue if not done yet

.Lbxlr: bx lr

You can get this down to 16 cycles for two entries (8 cycles per entry) by unrolling the loop once. Note that this almost triplicates the code length for only a small performance gain.

        .syntax unified
        .thumb

        @ r0: source
        @ r1: destination
        @ r2: number of words to copy
        @ the number in front of the comment is the number
        @ of cycles needed to execute the instruction

        @ first check if the number of elements is even or odd
        @ leave this out if it's know to be even
block:  tst r2, #1              @ 1 odd number of entries to copy?
        beq .Leven              @ 2 if not, proceed with eveness check

        ldmia r0!, {r3}         @ 2 load one item from source
        cmp r3, #10             @ 1 need to scale?
        ite lt                  @ 1 if r3 < 10:
        lsrlt r3, r3, #1        @ 1 then r3 >>= 1
        movge r3, #5            @ 1 else r3 = 5
        stmia r1!, {r3}         @ 2 store to destination
        subs r2, r2, #1         @ 1 decrement #words

        @ check if any elements are left
        @ remove if you know that at least two elements are present
.Leven: cbz r2, .Lbxlr          @ 2 return if no entries left.

.Loop:  ldmia r0!, {r3, r4}     @ 3 load two items from source

        cmp r3, #10             @ 1 need to scale?
        ite lt                  @ 1 if r3 < 10:
        lsrlt r3, r3, #1        @ 1 then r3 >>= 1
        movge r3, #5            @ 1 else r3 = 5

        cmp r4, #10             @ 1 need to scale?
        ite lt                  @ 1 if r5 < 10:
        lsrlt r4, r4, #1        @ 1 then r4 >>= 1
        movge r4, #5            @ 1 else r4 = 5

        stmia r1!, {r3, r4}     @ 3 store to destination
        subs r2, r2, #2         @ 1 decrement #words twice
        bne .Loop               @ 1 continue if not done yet

.Lbxlr: bx lr

7 cycles per element could be achieved by unrolling the loop four times, but I believe that is excessive.

Note that this code is in GNU as syntax. It should be trivial to modify it for your assembler.

IIRC, Cortex-M3 can more efficiently load a pair of words from a 64-bit aligned address, so you'd want your startup/cleanup to catch that case as well; maybe with a `cmp` against an end pointer. Or maybe optimize for aligned stores if src and dst can be misaligned relative to each other... I'd have to look up the Cortex-M3 perf info again. Also, don't forget `cbz` as a peephole for `cmp reg, #0` / `beq`. — Peter Cordes, Apr 14 '19 at 23:09
@PeterCordes Oh! I totally forgot about `cbz`. The [manual](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337e/CACCIFED.html) I have in front of me doesn't say anything about 64 bit alignment. It does indicate however that if a load is followed by a load or store that doesn't require its result in the AGU, then that load has one cycle shaved off. However, `ldmia` and `stmia` are even better here as they allow us to omit having to increment the registers. — fuz, Apr 14 '19 at 23:13
I don't see that either, that must have been for Cortex-A9 which I was also looking at before the OP of another question specified a core. http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/BABBCJII.html#id4044590 does mention loads taking 2 cycles for the first access, and 1 more per additional, so unless something speeds that up it might be a win to unroll by 4 and amortize that startup over more loads/stores. But yeah it looks like ldmia is good for post-increment. (Also http://infocenter.arm.com/help/topic/com.arm.doc.ddi0337e/CACGAHHB.html mentions the store buffer). — Peter Cordes, Apr 14 '19 at 23:20
You can use addressing modes that update the address register with `ldr` or `ldrd`, but I think those only work as 32-bit Thumb2 instructions, and `ldmia` isn't slow. — Peter Cordes, Apr 14 '19 at 23:21
@PeterCordes `ldmia` is also just `1 + N` cycles for `N` registers (same for `stmia`), so this code already takes advantage of this. 6 cycles per element seems to be the asymptotic minimum achievable with this implementation. To get 7 cycles instead of 8, unroll the loop four times instead of two. — fuz, Apr 14 '19 at 23:22
Another thought: if saturation to `5` is rare, `cmp`/branch for that to an out-of-line block that jumps back would make the fast path cheaper. (`cmp`/`bhi`/`lsr`). Unless `it` can decode as part of the later instructions, that takes it down from 4 to 3 instructions for the fast path. (And not-taken bcc instructions are single-cycle). But makes the slow path much worse, with 2 taken branches = +4 cycles on Cortex-M3 where immediate branch displacements only cost +1 cycle for a pipeline reload. — Peter Cordes, Apr 14 '19 at 23:45
@PeterCordes This would only save a single cycle in the fast path. I don't think it's really worth the effort. — fuz, Apr 14 '19 at 23:47
1 cycle out of 8 or 7 per word is a better than 10% speedup if the slow case never happens, so it depends on the data! The general idea of making the common case fast at the expense of the rare case (if it's rare enough) is definitely worth mentioning. But for realtime systems, you often care more about the worst possible case than average or best-case throughput, so branchless is clearly a win for that, and for code-size. — Peter Cordes, Apr 14 '19 at 23:53

How do I optimize a block copy and right shift + saturate to max=5, for Cortex-M3

1 Answers1