How to have GCC combine "move r10, r3; store r10" into a "store r3"?

Question

I'm working Power9 and utilizing the hardware random number generator instruction called DARN. I have the following inline assembly:

uint64_t val;
__asm__ __volatile__ (
    "xor 3,3,3                     \n"  // r3 = 0
    "addi 4,3,-1                   \n"  // r4 = -1, failure
    "1:                            \n"
    ".byte 0xe6, 0x05, 0x61, 0x7c  \n"  // r3 = darn 3, 1
    "cmpd 3,4                      \n"  // r3 == -1?
    "beq 1b                        \n"  // retry on failure
    "mr %0,3                       \n"  // val = r3
    : "=g" (val) : : "r3", "r4", "cc"
);

I had to add a mr %0,3 with "=g" (val) because I could not get GCC to produce expected code with "=r3" (val). Also see Error: matching constraint not valid in output operand.

A disassembly shows:

(gdb) b darn.cpp : 36
(gdb) r v
...

Breakpoint 1, DARN::GenerateBlock (this=<optimized out>,
    output=0x7fffffffd990 "\b", size=0x100) at darn.cpp:77
77              DARN64(output+i*8);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.ppc64le libgcc-4.8.5-28.el7_5.1.ppc64le libstdc++-4.8.5-28.el7_5.1.ppc64le
(gdb) disass
Dump of assembler code for function DARN::GenerateBlock(unsigned char*, unsigned long):
   ...
   0x00000000102442b0 <+48>:    addi    r10,r8,-8
   0x00000000102442b4 <+52>:    rldicl  r10,r10,61,3
   0x00000000102442b8 <+56>:    addi    r10,r10,1
   0x00000000102442bc <+60>:    mtctr   r10
=> 0x00000000102442c0 <+64>:    xor     r3,r3,r3
   0x00000000102442c4 <+68>:    addi    r4,r3,-1
   0x00000000102442c8 <+72>:    darn    r3,1
   0x00000000102442cc <+76>:    cmpd    r3,r4
   0x00000000102442d0 <+80>:    beq     0x102442c8 <DARN::GenerateBlock(unsigned char*, unsigned long)+72>
   0x00000000102442d4 <+84>:    mr      r10,r3
   0x00000000102442d8 <+88>:    stdu    r10,8(r9)

Notice GCC faithfully reproduces the:

0x00000000102442d4 <+84>:    mr      r10,r3
0x00000000102442d8 <+88>:    stdu    r10,8(r9)

How do I get GCC to fold the two instructions into:

0x00000000102442d8 <+84>:    stdu    r3,8(r9)

You have to leave out the `mr` from your inline asm template, and tell gcc that your output is in `r3`. Use `register int foo asm("r3");` if necessary on platforms that don't have specific-register constraints. Gcc will *never* remove text that's part of the asm template; it doesn't even parse it other than substituting in for `%thing`. Or better, don't hard-code a register number. Use an assembler that supports the `DARN` instruction. — Peter Cordes, Nov 27 '18 at 00:29
Also, `xor` is not an efficient way to zero a register on PowerPC. Dependency rules for the PPC asm equivalent of C++11 `std::memory_order_consume` require it to carry a dependency on the input register. xor-zeroing is only a thing on x86, not on fixed-instruction-width ISAs. — Peter Cordes, Nov 27 '18 at 00:33
Thanks Peter. *"Or better... use an assembler that supports the DARN instruction"* - I get what you are saying, but that's mostly impossible. GCC135 on the compile farm is Power9 and 2 months old. It has CentOS 7 and GCC 4.8. The compiler and assembler is a decade behind the support we need. Clang 7.0 also [lacks support](https://stackoverflow.com/q/53476239/608639) we need. — jww, Nov 27 '18 at 00:37
Regarding XOR not being efficient, what do you suggest? A simple `li 3, 0`? — jww, Nov 27 '18 at 00:39
Yup, exactly. That's what gcc does for `int foo(){return 0;}` https://godbolt.org/z/-gHI4C. Same deal for ARM: mov-immediate. — Peter Cordes, Nov 27 '18 at 00:41

Peter Cordes · Accepted Answer · 2018-11-27T06:39:57.157

2

GCC will never remove text that's part of the asm template; it doesn't even parse it other than substituting in for %operand. It's literally just a text substitution before the asm is sent to the assembler.

You have to leave out the mr from your inline asm template, and tell gcc that your output is in r3 (or use a memory-destination output operand, but don't do that). If your inline-asm template ever starts or ends with mov instructions, you're usually doing it wrong.

Use register uint64_t foo asm("r3"); to force "=r"(foo) to pick r3 on platforms that don't have specific-register constraints.

(Despite ISO C++17 removing the register keyword, this GNU extension still works with -std=c++17. You can also use register uint64_t foo __asm__("r3"); if you want to avoid the asm keyword. You probably still need to treat register as a reserved word in source that uses this extension; that's fine. ISO C++ removing it from the base language doesn't force implementations to not use it as part of an extension.)

Or better, don't hard-code a register number. Use an assembler that supports the DARN instruction. (But apparently it's so new that even up-to-date clang lacks it, and you'd only want this inline asm as a fallback for gcc too old to support the __builtin_darn() intrinsic)

Using these constraints will let you remove the register setup, too, and use foo=0 / bar=-1 before the inline asm statement, and use "+r"(foo).

But note that darn's output register is write-only. There's no need to zero r3 first. I found a copy of IBM's POWER ISA instruction set manual that is new enough to include darn here: https://wiki.raptorcs.com/w/images/c/cb/PowerISA_public.v3.0B.pdf#page=96

In fact, you don't need to loop inside the asm at all, you can leave that to the C and only wrap the one asm instruction, like inline-asm is designed for.

uint64_t random_asm() {
  register uint64_t val asm("r3");
  do {
    //__asm__ __volatile__ ("darn 3, 1");
      __asm__ __volatile__ (".byte 0x7c, 0x61, 0x05, 0xe6  # gcc asm operand = %0\n" : "=r" (val));
  } while(val == -1ULL);
  return val;
}

compiles cleanly (on the Godbolt compiler explorer) to

random_asm():
.L6:                 # compiler-generated label, no risk of name clashes
    .byte 0x7c, 0x61, 0x05, 0xe6  # gcc asm operand = 3

    cmpdi 7,3,-1     # compare-immediate
    beq 7,.L6
    blr

Just as tight as your loop, with less setup. (Are you sure you even need to zero r3 before the asm instruction?)

This function can inline anywhere you want it to, allowing gcc to emit a store instruction that reads r3 directly.

In practice, you'll want to use a retry counter, as advised in the manual: if the hardware RNG is broken, it might give you failure forever so you should have a fallback to a PRNG. (Same for x86's rdrand)

Deliver A Random Number (darn) - Programming Note

When the error value is obtained, software is expected to repeat the operation. If a non-error value has not been obtained after several attempts, a software random number generation method should be used. The recommended number of attempts may be implementation specific. In the absence of other guidance, ten attempts should be adequate.

xor-zeroing is not efficient on most fixed-instruction-width ISAs, because a mov-immediate is just as short so there's no need to detect and special-case an xor. (And thus CPU designs don't spend transistors on it). Moreover, dependency rules for the PPC asm equivalent of C++11 std::memory_order_consume require it to carry a dependency on the input register, so it couldn't be dependency-breaking even if the designers wanted it to. xor-zeroing is only a thing on x86 and maybe a few other variable-width ISAs.

Use li r3, 0 like gcc does for int foo(){return 0;} https://godbolt.org/z/-gHI4C.

edited Nov 27 '18 at 06:39

answered Nov 27 '18 at 00:56

Peter Cordes

328,167
45
605
847

1

Do you also recommend `li 4, -1` instead of `addi 4,3,-1`? – prl Nov 27 '18 at 02:03
@prl - Yeah, I already made that change. I got bogged down in the original problem, which was how to load the constant `0xffffffff ffffffff` into a register. A `0-1` seemed like a good choice until I learned I could load the immediate -1. – jww Nov 27 '18 at 02:11
Now open in the LLVM bug tracker: Issue Bug 39800, [Clang 7.0 missing Power9 __builtin_darn and friends](https://bugs.llvm.org/show_bug.cgi?id=39800) – jww Nov 27 '18 at 02:12
@prl: yes, of course. But really I recommend letting the compiler do it by using a `[minus_one] "ri"(-1LL)` input constraint, and a `cmp 0,0,3,%[minus_one]` in the asm. You only need to force the register allocation for the one register involved in the instruction you're hand-encoding. – Peter Cordes Nov 27 '18 at 02:16
I thought I would try removing `__volatile__` and give GCC an opportunity to move things around. GCC removed most of the code so the buffers were not random. The GCC inline assembler absolutely sucks. I would be embarrassed if I published a tool that sucked this bad. Not only we cannot express what we want to do, we have to use volatile to keep GCC from breaking shit. – jww Nov 27 '18 at 03:12
@jww: what the hell are you talking about? If you remove `__volatile__`, then it can CSE and assume that every run of the asm block with the same inputs will produce the same outputs. Enabling that kind of optimization is the whole point of non-`volatile` asm: don't use it when your asm is not a pure function of the inputs, like for a timestamp or RNG instruction. What exactly do you think you can't express here? I mean yes the builtin/intrinsic would be even better, but with proper constraints the compiler should be able to optimize pretty well. – Peter Cordes Nov 27 '18 at 04:10
*"What exactly do you think you can't express here?"* - I can't connect the output of `darn` with `r3` without a useless intermediate move. `=r3 (val)` or `=3 (val)` succinctly expresses what I want to do, but I can't say it because of this broken tool. If I could succinctly express it, then GCC would know not to remove the block because the data mattered. – jww Nov 27 '18 at 04:24
And I'm back to this problem on AIX: [1252-142 Syntax error from AIX assembler due to local label](https://stackoverflow.com/q/51869790/608639). GCC inline assembly absolutely blows. – jww Nov 27 '18 at 04:50
@jww: You can get the equivalent of `"=r3"` using a register local like I said in my answer, and like David commented on your duplicate of this. It's somewhat cumbersome compared to a specific-register constraint, but it does completely solve the efficiency problem you're complaining about. Updated my answer with a working example, but I'd suggest reading more carefully in future. You frequently seem to be missing / not taking in things that people are telling you, (or that documentation says). – Peter Cordes Nov 27 '18 at 06:00
I work in C++. `register` keyword is going away. Sorry if I do't always point out the minucia for you. – jww Nov 27 '18 at 06:11
@jww: `register asm` local variables are *not* going away in GNU C++. See g++8.2 for x86, with `-std=c++17`, where `register int x;` is not allowed by ISO C++17. https://godbolt.org/z/_YLfEs G++ warns for it, but *not* for `register int x __asm__("eax")`, proving that GNU C and C++ will continue to support the GNU extension regardless of the language removing the `register` keyword from the base language. – Peter Cordes Nov 27 '18 at 06:34
We cut this in at [Commit 3db34abf2f9e](https://github.com/weidai11/cryptopp/commit/3db34abf2f9e). Thanks for your help. – jww Nov 27 '18 at 08:48
@jww: glad I could help, but your code looks really inefficient. Why put DARN behind a non-inline function in a `.cpp` file? (Unless everyone always compiles the library with LTO) And why return an output by pointer arg instead of as a return value? Is `register uint64_t val asm("r3");` not portable to IBM's XLC compiler? I see you're still using an inefficient `mr` for no apparent reason. – Peter Cordes Nov 27 '18 at 09:13
The short answer to non-inlining is, x86_64 is the exception and not the rule. ARM, Aarch64, MIPS and PowerPC require compiler options to activate ISAs. If higher ISAs are used in header files, then users have to do things like `-mcpu=power9` in every source file that includes the header. That's a headache I don't want. It will generate countless bug reports and mailing list messages. – jww Nov 27 '18 at 09:29
You might also be interested in [BASE+SIMD](https://www.cryptopp.com/wiki/BASE+SIMD) on the Crypto++ wiki. It discusses the problem in more detail. It also says, *"All of this trouble [with ISAs and compiler options] would have been avoided if the compilers simply made the instructions available out of the box for user code"*. GCC and x86_64 does that for lower ISAs, but that's the only platform I am aware. – jww Nov 27 '18 at 09:36
@jww: oh right, that makes sense for calls from outside the library, and allows runtime dispatching I guess. If there aren't any callers from inside the library, or the library itself is always built with LTO, then you're fine, otherwise you might want an internal inline version, and an externally-visible non-inline wrapper that just *uses* the inline definition. – Peter Cordes Nov 27 '18 at 09:37
@jww: If you do have functions that only run after checking for CPU support at runtime, you can use `__attribute__((target("power9-misc")))` on them https://gcc.gnu.org/onlinedocs/gcc/PowerPC-Function-Attributes.html#PowerPC-Function-Attributes. But they can't inline into functions that *don't* have that attribute. Anyway, https://godbolt.org/z/Zd3i6P shows how that lets a function using `__builtin_darn()` compile without any `-m` options with PPC64 GCC6.3. Unless there's some portability problem with that (to clang or XLC), that should solve another one of your gcc complaints. – Peter Cordes Nov 27 '18 at 10:11
But with inline asm, you don't even need that. One of the few benefits of inline asm is being able to insert instructions that aren't enabled for use by the compiler. You should be able to `#ifdef` the version that uses a builtin, and only use inline asm in a header. And BTW, you could simplify the big/little-endian versions of the `.byte` encoding by using a macro, instead of copying it to multiple asm blocks. Or can you simply use `.word xxxxx` and let the assembler pick whether the right endianness? – Peter Cordes Nov 27 '18 at 10:13

How to have GCC combine "move r10, r3; store r10" into a "store r3"?

1 Answers1

Linked