How to prevent LDM/STM instructions expansion in ARM Compiler 5 armcc inline assembler?

Question

I'm trying to generate ~~AXI~~ bus burst accesses using STM/LDM instructions in inline assembly in .c file compiled with ARM Compiler 5 armcc.

inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
    __asm {
        STMIA addr!, { w0, w1 }
    }
}

But ARM Compiler armcc User Guide, paragraph 7.18 is saying: "All LDM and STM instructions are expanded into a sequence of LDR and STR instructions with equivalent effect. However, the compiler might subsequently recombine the separate instructions into an LDM or STM during optimization."

And that is what really happens in practice, LDM/STM are expanded into a set of LDR/STR in some cases and order of these instuctions is arbitrary. This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).

To resolve this it's possible to use embedded assembler instead of inline assembler, but this leads to extra function calls-returns what affects performance.

So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.

Target CPU: Cortex M0+ (ARMv6-M).

Edit: Slave devices are all on-chip devices, most of them are non-memory devices. For every register of non-memory slave that supports burst access region of address space is reserved (for example [0x10000..0x10100]), I'm not completely sure why, maybe CPU or bus doesn't support fixed (non-incremental) addresses. HW ignores offsets within this region. Full request can be 16 bytes for example and first word of the full request is first word written (even if offset is non-zero).

If you care that much about the performance, then write more of what you need in a separate assembler file. A single instruction in an inline C function won't get you much considering how badly the compiler handles the rest of your code. My operating principal is always - if you care about the performance of a time-critical routine, write it yourself (in assembler). — BitBank, Oct 17 '15 at 06:19
@imiron13: I suspect that you are screwed. Kiel inline assembly let's the optimizer out-of-the-bag and lack fine-grained control over what it "optimizes". How bad is the code generation if you use normal volatile pointers to insure writer order with 64-bit types in an attempt to combined the writes? — doynax, Oct 17 '15 at 06:26
@BitBank: My assumption would be that the performance hit is not isolated to a single critical innerloop, which could easily be hand-tuned, but that the writes are generated inline into significant segments of the codebase. — doynax, Oct 17 '15 at 06:28
@BitBank: case is like doynax described, most of the HW is optimized to work with bursts and it's not reasonable to implement whole HAL layer in assembly. — imiron13, Oct 17 '15 at 07:18
@doynax: I didn't try to derefrence (volatile uint64_t*) yet, maybe it will produce LDM/STM, but 16-64 bytes bursts are also required in other cases, so it's seems not possible to express this in C-code. — imiron13, Oct 17 '15 at 07:23
@imiron13: You can always try bunching things up and using structure assignments. Still, even if you succeed relying on the optimizer to reliably generate specific code is always a wing-and-a-prayer thing to do.. — doynax, Oct 17 '15 at 07:25
@doynax: Thanks for the advice with structures, will try that. Also agree that this is not reliable. I thought maybe there is some keyword exist like "don't touch my assembly".. — imiron13, Oct 17 '15 at 07:34
"Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions)." It sounds like your hardware is broken if it can't handle correct sequences of instructions. This is your biggest problem as you can NEVER trust the hardware and the compiler to do the "right" thing. — Russ Schultz, Oct 17 '15 at 12:28
FWIW, dereferencing a uint64_t* (or accessing a uint64_t at all, generates a LDM/STM for me. I've worked on a part that had a peripheral bridge that couldn't handle burst transactions, so we couldn't ever do 64 bit reads/writes to it. Luckily it wasn't the memory bridge... — Russ Schultz, Oct 17 '15 at 12:30
@RussSchultz: strange. I'd expect to use `ldrd` for thatm because it is more flexible wit the registers. I will check my own code (gcc). — too honest for this site, Oct 17 '15 at 14:32
The Cortex-M0+ core itself [is only capable of non-sequential transactions](http://infocenter.arm.com/help/topic/com.arm.doc.ddi0484c/Babbeicd.html), so the whole issue of getting one to emit bursts seems moot anyway... — Notlikethat, Oct 17 '15 at 18:04
@Olaf, Notlikethat: Sorry, I might be wrong with AXI bus here, maybe some other bus, but pretty sure about Cortex-M0+ (not sure if it is customized in any way). And our design states, that bursts need to be used, so I suppose bus supports bursts (I trust our HW designer). — imiron13, Oct 17 '15 at 18:29
Is that for on-chip resorces or an external device? Sounds uncommon to do such optimisations for a CM0. Exception is the MPU, but you need Assembler here anyway. Anyway, I would not use a compiler which fiddles with inline-assembler code. Why not use gcc? That does not have such attitudes.. — too honest for this site, Oct 17 '15 at 19:06
@Olaf: on-chip, no MPU. Maybe armcc was chosen because it's commercial and can have better optimizations, but this is not finalized and we can switch to gcc. But as artless noise explained (see link in his answer), gcc also has issues with LDM/STM expansion. — imiron13, Oct 17 '15 at 19:44
@olaf. Gcc isn't as good at size optimization as Armcc or keil. Sometimes that trumps all. — Russ Schultz, Oct 17 '15 at 19:56
@RussSchultz:Which gcc version are you refereing to? AFAIK, especially for the Cortex-M, it has evolved very much since first support. gcc 4.9.3 (not sure if 4.9 already did) added options to optimise for slow Flash for instance (reduces usage of literal pools). But I'm not sure about the M0, Just used it for M3/4 (and hopefully soon for M7). Also size seems to be less of a matter for OP; he seems to be for speed. — too honest for this site, Oct 17 '15 at 19:58
Last time I did a size comparison (a few years ago) gcc was still lagging behind. It's getting better from the previous time (about 10 years ago with ARM9/thumb), but still not on par. — Russ Schultz, Oct 17 '15 at 20:03
Thanks for opinions on armcc vs gcc, this is useful since we are not finilized the chose yet. @Olaf: size is also very critical... Maybe even more than performance, we are going to use -Os. — imiron13, Oct 17 '15 at 20:10
Anyway, before doing preliminary optimisations, you should profile a readable and working code. It is nonsense to expect the compiler to generate a specific kind of code. For instance: the compiler might (actually it likely will) inline the function and use completely different registers for the arguments, making `stm` unpractically (which might be the reason it behaves as it does). This can easily be one reason gcc generates more code. In general, there is always a trade-of between speed and size. — too honest for this site, Oct 17 '15 at 20:15
Adter the edit, I know such designs. Actually the MPU of M3/4/7, not sure about M0) uses a similar approach to allow setting up to four regions with a single instruction. Recommendation is to provide two specialised assembler functions (read/write) to move such a block correctly. You should not rely on the compiler here. As armcc seems to reorder even inline assembler, you have to use a seperate compilation unit and hope there is no LTO. — too honest for this site, Oct 18 '15 at 14:57
@RussSchultz: a few years ago is a very long time. Cortex-M0 support is not that long in gcc and there have quite some improvements been made to gcc to support the ARM CPUs - with active help from ARM. Note that the Cortex-M are Thumb2-only and that is different in some aspects from Thumb1 or ARM-"native". You really should re-evaluate with 4.9.3 or new gcc. Until then I'd be very careful with such statements. — too honest for this site, Oct 18 '15 at 15:01
I evaluated cortex m3, not the m0. Arm gave me the same schpiel at the time that you're giving me now. Another colleague of mine recently did a comparison based on gcc vs keil due to a push by management to use our public tools on our internal products. The same answer came back: commercial tools are better at the metrics we cared about. I'm comfortable making the statements I make based on my measured results. — Russ Schultz, Oct 18 '15 at 19:14

score 1 · Answer 1 · edited May 23 '17 at 10:27

So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.

A little bit about compiler optimizations. Register allocation is one of it's toughest jobs. The heart of any compiler's code generation is probably around when it allocates physical CPU registers. Most compilers are using Single static assignment or SSA to rename your 'C' variables into a bunch of pseudo variable (or time order variables).

In order for your STMIA and LDMIA to work you need the loads and stores to be consistent. Ie, if it is stmia [rx], {r3,r7} and a restore like ldmia [rx], {r4,r8} with the 'r3' mapping to the new 'r4' and the stored 'r7' mapping to the restored 'r8'. This is not simple for any compiler to implement generically as 'C' variables will be assigned according to need. Different versions of the same variable maybe in different registers. To make the stm/ldm work those variable must be assigned so that register increments in the right order. Ie, for the ldmia above if the compiler want the stored r7 in r0 (maybe a return value?), there is no way for it to create a good ldm instruction without generating additional code.

You may have gotten gcc to generate this, but it was probably luck. If you proceed with only gcc, you will probably find it doesn't work as well.

See: ldm/stm and gcc for issues with GCC stm/ldm.

Taking your example,

inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
    __asm {
        STMIA addr!, { w0, w1 }
    }
}

The value of inline is that the whole function body may be put right in the code. The caller might have the w0 and w1 in registers R8 and R4. If the function is not inline, then the compile must place them in R1 and R2 but may have generated extra moves. It is difficult for any compiler to fulfil the requirements of the ldm/stm generically.

This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).

If the hardware is a particular non-memory slave peripheral on the bus, then you can wrap the functionality to write to this slave in an external wrapper and force the register allocation (see AAPCS) so that ldm/stm will work. This will result in a performance hit which could be mitigated by some custom assembler in the driver for the device.

However, it sounds like the device might be memory? In this case, you have a problem. Normally, memory devices like this will use a cache only? If your CPU has an MPU (memory protection unit) and can enable both data and code cache, then you might resolve this issue. Cache lines will always be burst accesses. Care only needs to be taken in the code to setup the MPU and the data cache. OPs Cortex-M0+ has no cache and the devices are non-memory so this will not be possible (nor needed).

If your device is memory and you have no data cache then your issue is probably unresolvable (without massive effort) and you need different hardware. Or you can wrap it like the peripheral device and take a performance hit; loosing the benefits of the random access of the memory device.

(This is more related to the question you've linked in your post, but I cannot comment there). Does 'issues with gcc' mean that it's possible to generate bursts there reliably, but it can be not optimal due to explicit registers assignments? — imiron13, Oct 17 '15 at 22:47
While, you mileage may vary depending on the GCC version and option flags. I found that the only reliable way was to either use a function call and rely on register argument OR specify gcc register `asm` variables. If you don't, you may get mysterious assembler messages about register being out of order for stm/ldm. Ie, some one changes code or compiler options and GCC choices registers that don't work. — artless noise, Oct 17 '15 at 22:50
If this is all 'on chip' and your company hasn't designed it, but some large corporation and the chip set is widely used, the I suspect there is some sort of bus configuration needed. It is hard to believe that a commercial SOC would have the problems you describe in your edit. Anyone using the chip would have the same problems. If the chip is a sample of a future product, then this might be an errata and I would expect if to be fixed before the chip goes to wide production. — artless noise, Oct 18 '15 at 17:47

How to prevent LDM/STM instructions expansion in ARM Compiler 5 armcc inline assembler?

1 Answers1

Linked