`ldm/stm` in gcc inline ARM assembly

Question

I am trying to create an ldm (resp. stm) instruction with inline assembly but have problems to express the operands (especially: their order).

A trivial

void *ptr;
unsigned int a;
unsigned int b;

__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));

does not work because it might put a into r1 and b into r0:

ldm ip!, {r1, r0}

ldm expects registers in ascending order (as they are encoded in a bitfield) so I need a way to say that the register used for a is lower than this of b.

A trivial way is the fixed assignment of registers:

register unsigned int a asm("r0");
register unsigned int b asm("r1");

__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));

But this removes a lot of flexibility and might make the generated code not optimal.

Does gcc (4.8) support special constraints for ldm/stm? Or, are there better ways to solve this (e.g. some __builtin function)?

EDIT:

Because there are recommendations to use "higher level" constructs... The problem I want to solve is packing of 20 bits of a 32 bit word (e.g. input is 8 words, output is 5 words). Pseudo code is

asm("ldm  %[in]!,{ %[a],%[b],%[c],%[d] }" ...)
asm("ldm  %[in]!,{ %[e],%[f],%[g],%[h] }" ...) /* splitting of ldm generates better code;
                                                  gcc gets out of registers else */
/* do some arithmetic on a - h */

asm volatile("stm  %[out]!,{ %[a],%[b],%[c],%[d],%[e] }" ...)

Speed matters here and ldm is 50% faster than ldr. The arithmetic is tricky and because gcc generates much better code than me ;) I would like to solve it in inline assembly with giving some hints about optimized memory access.

Have you seen this? http://gcc.gnu.org/ml/gcc-help/2007-04/msg00092.html — auselen, Dec 17 '13 at 11:10
@auselen thx for the link; it is exactly the problem I am describing. But post is from 2007 and perhaps something has been changed since then? — ensc, Dec 17 '13 at 11:13
I bet not. "it would require at least a partial rewrite of gcc's register allocator." http://gcc.gnu.org/ml/gcc-help/2007-04/msg00109.html — auselen, Dec 17 '13 at 11:23
Best would be to solve your problem at a higher level and be register agnostic. — auselen, Dec 17 '13 at 11:28

score 1 · Answer 1 · edited May 23 '17 at 12:19

I have recommended the same solution in ARM memtest. Ie, explicitly assign the registers. The analysis on gcc-help is wrong. There is no need to re-write GCC's register allocation. The only thing that is needed is to allow the ordering of registers in an assembler specification.

That said the following will assemble,

int main(void)
{
    void *ptr;
    register unsigned int a __asm__("r1");
    register unsigned int b __asm__("r0");

    __asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));
    return 0;
}

This will not compile as there is an illegal ARM instruction, ldm r3!,{r1,r0} in my gcc. A solution is to use the -S flag to assemble only and then run a script that will order the ldm/stm operands. Perl can easily do this with,

$reglist = join(',', sort(split(',', $reglist)));

Or any other way. Unfortunately, there doesn't appear to be anyway to do this using assembler constraints. If we had access to an assigned register number, inline alternative or conditional compiling could be used.

Probably the easiest solution is to use explicit register assignment. Unless you are writing a vector library that needs to load/store multiple values and you want to give the compiler some freedom to generate better code. In this case, it is probably better to use structures as the higher level gcc optimizations will be able to detect un-needed operation (such as multiplies by one or addition of zero, etc).

Edit:

Because there are recommendations to use "higher level" constructs... The problem I want to solve is packing of 20 bits of a 32 bit word (e.g. input is 8 words, output is 5 words).

This will probably give better results,

  u32 *ip, *op;
  u32 in, out, mask;
  int shift = 0;
  const u32 *op_end = op + 5;

  while(op != op_end) {
     in = *ip++;
     /* mask and accumulate... */
     if(shift >= 32) {
       *op++ = out;
       shift -=32;
     }
  }

The reasoning is that the ARM pipeline is generally several stages. With a separate load/store unit. ALU (arithmetic) may proceed in parallel with the load and the store. So you can be working on the first word while you are loading later words. In this case, you may also replace the value in-place which will give a cache benefit, unless you need to re-use the 20-bit values. Once the code is in the cache, the ldm/stm has little benefit if you stall on data. That will be your case.

2nd Edit: The main job of a compiler is to not load values from memory. Ie, register assignment is crucial. Generally, the ldm/stm are most useful in memory transfer functions. Ie, a memory test, a memcpy() implementation, etc. If you are doing computation with the data, then the compiler may have better knowledge about pipe line scheduling. You probably need to either accept plain 'C' code or move to complete assembler. Remember, the ldm has the first operands available to use immediately. Use of the ALU with subsequent registers can cause a stall for the data to load. Similarly, the stm needs the first register calculations to be complete when it executes; but this is less critical.

Btw, a [new register allocator](http://gcc.gnu.org/gcc-4.8/changes.html) has been re-written for gcc-4.8, but only for IA-32 and x86-64. *A new local register allocator (LRA) has been implemented, which replaces the 26 year old reload pass and improves generated code quality. For now it is active on the IA-32 and x86-64 targets.* — artless noise, Dec 17 '13 at 17:35
The preprocessing to generate a correct list of registers will not work in my case because the arithmetic is sensitive to the order of read data. — ensc, Dec 17 '13 at 19:24
You must use the `register u32 a asm("rX");` in this case. Especially with 8 values; your original question only had two. With this many register, you are going to spill anyways. Also, you maybe better to interleave read/process/write. The load/store is a separate unit, so this may actually be faster. `ldm/stm` will not pipeline. — artless noise, Dec 17 '13 at 20:00
artless, register is only a hint keyword which only provides advise similar to the inline keyword. It is not guaranteed that your variable will actually end up in a hardware register (unless ofcourse you are already performing an operation on it). — sgupta, Dec 28 '13 at 08:00
@user1075375 wrong; `register` is required for this kind of annotation. — ensc, Dec 28 '13 at 17:40
@ensc, I'm not saying register is required or not. I'm just saying that register keyword by defination is merely a hint. It's upto the compiler to actually fullfil that request or not just like inline. — sgupta, Dec 28 '13 at 18:16
@user1075375 This [qemu issue](https://lists.gnu.org/archive/html/qemu-devel/2013-05/msg04675.html) may help with your deleted question. There was a bug with `RFE` handling and QEMU. — artless noise, Jan 24 '14 at 19:52

`ldm/stm` in gcc inline ARM assembly

EDIT:

1 Answers1

Linked