How to get the ARM compiler to use the STM instruction instead of STR?

Question

Compiling a C function that reads a memory location repeatedly and writes into a memory buffer, I am trying to get the compiler to generate code using STM instruction instead of multiple STRs.

The target CPU is Cortex-M0+, which does not have an instruction prefetch unit nor a cache, so the assumption is that a single STM instruction is more economical than multiple STRs in terms of instruction fetch cycles.

I am aware of the -fldm-stm option, but this is just a feature enable and not a compile hint.

The reference code is:

#include <stdint.h>

#define port (0x12345678U)
extern uint32_t buf[16];

void myfunc(void)
{
    uint32_t *p = buf;

    for (uint8_t i=0; i<16; i++)
    {
        *(p++) = *(volatile uint32_t *)(port);
    }
}

Compile options: -O3 -fldm-stm --target=arm-arm-none-eabi -mcpu=cortex-m0+ -mthumb

Update 1: Considering some good tips in the comments, I changed the code and options, adding a loop-unroll pragma and optimizing for size:

#include <stdint.h>

#define port (0x12345678U)
extern uint32_t buf[16];

void myfunc(void)
{
    uint32_t *p = buf;

#pragma unroll (4)

    for (uint8_t i=0; i<16; i++)
    {
        *(p++) = *(volatile uint32_t *)(port);
    }
}

Compile options: -Os -fldm-stm --target=arm-arm-none-eabi -mcpu=cortex-m0+ -mthumb

Still the compiler won't use the STM instruction.

UPDATE 2: More tweaking, and I am now able to get much closer to the construct I am looking for:

#include <stdint.h>

#define port (0x12345678U)
extern uint32_t buf[16];

void myfunc(void)
{
    register uint32_t r0, r1, r2, r3;
    uint32_t *p = buf;

    for (uint8_t i=0; i<16; i+=4)
    {
         r0 = (uint32_t) (*(volatile uint32_t *)(port));
         r1 = (uint32_t) (*(volatile uint32_t *)(port));
         r2 = (uint32_t) (*(volatile uint32_t *)(port));
         r3 = (uint32_t) (*(volatile uint32_t *)(port));
        *(p++) = r0;
        *(p++) = r1;
        *(p++) = r2;
        *(p++) = r3;
    }
}

Compiler Explorer now emits the following loop body:

.LBB0_1:
    ldr     r3, [r2]
    ldr     r4, [r2]
    ldr     r5, [r2]
    ldr     r6, [r2]
    stm     r1!, {r3, r4, r5, r6}  ;; Bingo!
    adds    r1, #0                 ;; Why do we need this line?
    adds    r0, r0, #4
    cmp     r0, #12
    blo     .LBB0_1

It is not clear to me why that line I pointed out is required. Any idea?

Optimizing for size (`-Os`) instead of speed might do it: https://godbolt.org/z/bY8xcYjov — Nate Eldredge, Jul 18 '22 at 17:59
@NateEldredge - well, surprisingly, this actually did generate the `STM` instruction, *but* not in the intended way. It converted the unrolled loop of `LDR`/`STR` into a proper loop, with one `LDR` and one `STM` per iteration. As such it does not really solve the problem (actually making it worse for 16 iterations, as the loop logic adds a substantial overhead. — ysap, Jul 18 '22 at 18:08
@NateEldredge - actually looking at the link you sent, changing back to `-O3`, it looks like the generated code is sub optimal, as it issues two loads per iteration. In my local build (*slightly* different code) there is a single load, as it should. Not your example does not do that. — ysap, Jul 18 '22 at 18:13
@ysap it is not suboptimal - it is how the `volatile` data works. — 0___________, Jul 18 '22 at 18:17
IMO STM will be slower considering that ART will not be used and STM `n 1S+(n-1)I 1N+(n-1)S Storing n registers, n > 1.` — 0___________, Jul 18 '22 at 18:18
@ysap: Yeah, don't there have to be two loads? `port` itself is `volatile`, and then `*(volatile uint32_t *)port` is `volatile` as well, so both must be reloaded on every access. Hopefully your local version is missing one of those two `volatiles`, else it is being miscompiled. — Nate Eldredge, Jul 18 '22 at 18:18
Apparently you are right. Looks like I have double `volatile` there. I'll fix the code in a moment. Anyway, the problem is with the STM. A somewhat unrolled loop (say, 4 iterations unrolled) allows the compiler to use registers for storing the loaded data, then write them in one instruction. Maybe the corrected code will do that (not on my local machine, though). — ysap, Jul 18 '22 at 18:21
Regarding your "assumption" that STM is faster, have you verified it experimentally? — Nate Eldredge, Jul 18 '22 at 18:27
@NateEldredge - A single `STM` *should* be faster than multiple `STR`s, as it spares the extra fetches, absent a cache or a prefetch unit. Someone else gave a hand-optimized version a try and it looks fine. I am trying to get to the same ballpark with a compiled code. — ysap, Jul 18 '22 at 18:40
OK, removing the extra `volatile` did remove the extra `LDR`. Thanks for the correction. — ysap, Jul 18 '22 at 18:42
just write it in asm as with any other optimization like this — old_timer, Jul 18 '22 at 23:48
@old_timer - thanks for the advice. You can assume that there are reasons why I wouldn't want an assembly implementation. It is also serves as a great educational experience. `memcpy()` is not a solution to the problem, b/c the source address is constant. — ysap, Jul 18 '22 at 23:48
This is an interesting thread. I am trying to force the use of STM in a C++ program using armclang. However, it seems that since C++17, 'register' is deprecated, unused and reserved. An assembler solution seems desirable but I don't know how to do it. Any suggestions? — DavidA, Jul 20 '22 at 13:31
@DavidA - if you want to go the hand-optimized asm route, typically easiest thing is to start from a disassembled C version. Let `fromelf` get you a clean disassembly source and then change the function body. You then call `armclang` with the `-S` option, instead of `-c`, to assemble your module into an object. — ysap, Jul 20 '22 at 16:15

How to get the ARM compiler to use the STM instruction instead of STR?

0 Answers0