ARM inline assembly multi-precision multiplication

Question

I am new to ARM assembly, and I want to implement one of my C functions in inline assembly. My functions is multi-precision multiplication which multiplies 32-bit unsigned integer with 256-bit unsigned integer and put the result into 288-bit unsigned integer data type. I defined my data-type as:

typedef struct UN_256fe{

uint32_t uint32[8];

}UN_256fe;

typedef struct UN_288bite{

uint32_t uint32[9];

}UN_288bite;

and here is my function:

void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res){
uint32_t temp;
asm (    "umull         %0, %1, %9, %10;\n\t"
         "umull         %18, %2, %9, %11;\n\t"
         "adds          %1, %18, %1;    \n\t"
         "umull         %18, %3, %9, %12;\n\t"
         "adcs          %2, %18, %2;    \n\t"
         "umull         %18, %4, %9, %13;\n\t"
         "adcs          %3, %18, %3;    \n\t"
         "umull         %18, %5, %9, %14;\n\t"
         "adcs          %4, %18, %4;    \n\t"
         "umull         %18, %6, %9, %15;\n\t"
         "adcs          %5, %18, %5;    \n\t"
         "umull         %18, %7, %9, %16;\n\t"
         "adcs          %6, %18, %6;    \n\t"
         "umull         %18, %8, %9, %17;\n\t"
         "adcs          %7, %18, %7;    \n\t"
         "adc           %8, %8, 0 ;     \n\t"

         : "=r"(res->uint32[8]), "=r"(res->uint32[7]), "=r"(res->uint32[6]), "=r"(res->uint32[5]), "=r"(res->uint32[4]),
           "=r"(res->uint32[3]), "=r"(res->uint32[2]), "=r"(res->uint32[1]), "=r"(res->uint32[0])
         : "r"(A), "r"(B->uint32[7]), "r"(B->uint32[6]), "r"(B->uint32[5]),
           "r"(B->uint32[4]), "r"(B->uint32[3]), "r"(B->uint32[2]), "r"(B->uint32[1]), "r"(B->uint32[0]), "r"(temp));

}

It seems fine to me. But when I debug my code, for example at first line after performing "umull %0, %1, %9, %10;\n\t" I have:

(gdb) p/x A //-->%9
$8 = 0x1
(gdb) p/x B->uint32[7] //-->%10
$9 = 0xffffff1
(gdb) p/x res->uint32[8] //-->%0
$10 = 0x1
(gdb) p/x res->uint32[7] //-->%1
$11 = 0x0

It seems that I made some mistakes in my assembly instructions. Can anyone explain it to me?

I'm not sure what your problem is, but none of the values in the `uint32` array will change after executing the first instruction in the asm statement. As it does everything in registers, it's not until sometime after the all of the instructions in the statement are executed that the values in the registers will be stored in the array. When exactly depends on the code the compiler emits after the assembly statement to do these stores. — Ross Ridge, Dec 13 '15 at 20:47
Also your output operands needs to marked with early clobber constraints `=&r` so the compiler knows not to use the registers assigned to these operands as input operands. The `temp` operand should be an early clobber output operand so the compiler knows the asm statement modifies the register. — Ross Ridge, Dec 13 '15 at 20:53
@RossRidge in the first line, what I am expected is A(%9) multiplies with B->uint32[7](%10) and result is stored in res->uint32[8](%0) and res->uint32[7](%1). So, based on what you said, the result is stored in a register, but why? I have already defined my destination. Should I add some other assembly lines to store my results in res->uint32[7] and res->uint32[8] ?? as I said I am new to ARM assembly but I did the same procedure in PTX and it worked! — A23149577, Dec 13 '15 at 20:56
It's stored in a register because the constraints for `%0` and `%1` use `r` which restricts the operand to a register. If you used `m` instead then compiler would use a memory operand instead, but the assembler then would reject the statement because the UMULL instruction doesn't take memory operands. You don't need to add statements to do the stores yourself. As I said, the compiler will automatically emit the necessary code (assembly instructions) to do these stores after your asm statement. — Ross Ridge, Dec 13 '15 at 21:13
@RossRidge I am sorry but I don't get it. When I replace `=r` with `=&r` the compiler gives this error: `can’t find a register in class ‘GENERAL_REGS’ while reloading ‘asm’`. I think it relates to the number of general registers in my ARM processor and it seems that I used a lot of registers in my code. Would you please give me an example of inline assembly code which get two `uint32_t` operand from memory and multiply them, then store the 64-bit result into memory? — A23149577, Dec 13 '15 at 21:54
The fact that you're getting that error makes sense. You've got 19 operands that all need to given different registers or the asm statement won't work correctly, but there are only at most 15 registers the compiler can use. You don't need asm statement to do that, just use `uint64_t n = (uint64_t) A * B->uint32[7]`. — Ross Ridge, Dec 13 '15 at 22:07
@artlessnoise: nice find to help the OP with the follow-on problem, but this question is still about where the output values will be while single-stepping the asm. (And I'd suggest looking at the disassembly of the final result while debugging, that'll make it clear what the compiler did for data movement, and that the outputs will still be in regs until the stores happen.) To work around this, you could maybe give the inline asm some memory outputs, and store to them. Or pass addresses into the inline asm and do the loads / stores too, to overcome the register limitations. — Peter Cordes, Dec 15 '15 at 00:14
@artlessnoise: yeah, fair enough, closing as a dup is a good way to get it out of the system. Any direct answer to this question isn't going to add anything that isn't in comments already, and it's prob. not going to help anyone else anyway. Splitting up the inline asm is a terrible idea, though. Compilers are free to stick an instruction that modifies the flags in between any `asm` blocks. And with this code, that could well happen with instructions to compute addresses to set up inputs/outputs. (And the weather wasn't too bad. Pretty poor driving on the way back from PEI, though.) — Peter Cordes, Dec 15 '15 at 03:42

ARM inline assembly multi-precision multiplication

0 Answers0