0

I have a problem about inline-assembly in AArch64, Linux, gcc version is 7.3.0

uint8x16_t vcopyq_laneq_u8_inner(uint8x16_t a, const int b, uint8x16_t c, const int d)
{
    uint8x16_t res;
    __asm__ __volatile__(
    :"ins %[dst].B[%[dlane]], %[src].B[%[sland]]    \n\t"
    :[dst] "=w"(res)
    :"0"(a), [dlane]"i"(b), [src]"w"(c), [slane]"i"(d)
    :);
    return res;
}

This function used to be a inline function that can be compiled and link to a executable programs. But now we want to compile this function into a dynamic library, so we removed its inline keyword. But it cannot compile successfully, and error info is:

warning: asm operand 2 probably doesn't match constraints
warning: asm operand 4 probably doesn't match constraints
error: impossible constraint in 'asm'

I guess this error happend because of the inline-assembly code "i" need a "immediate integer operand", but the var 'b' and 'd' is constant-var, isn't it?

And now i have an idea to make this function compile successfully, thats use if-else to judge the value of 'b' and 'd', and replace dlane/sland with "immediate integer operand". But in our code, uint8x16_t means a structrue of 16 uint8_t var, so i need coding 16x16==256 if-else statement, thats inefficient.

So my question is following:

  1. Why this function can be complied and linked successfully to an executable programs with inline properties, but cant not complied to an Dynamic Link Library without inline properties?
  2. Is there have an efficient way to avoid using 256 if-else statement?
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Update the question to specify the architecture this is for. I am looking at a reference manual for the ARM v7-A and ARM v7-R architectures, and it does not show any `ins` instruction, so your `arm` tag is insufficient to identify the architecture. – Eric Postpischil Sep 05 '22 at 14:06
  • Re question 1: When the function is called with literals for arguments, or other compile-time constant expressions, and it is inlined, the compiler can see the values and prepare immediate operands for them. When the function is not inlined, the compiler has only the function parameters to work with, and it cannot create immediate operands for variable parameters. – Eric Postpischil Sep 05 '22 at 14:07
  • Re question 2: An obvious solution would be to change the instruction to a form that does not require immediate operands or to replace it by other instructions that accomplish the effect or by C code. In order to do this, it is necessary to know what the instruction does. Hence you must specify the architecture or specify what the instruction does. – Eric Postpischil Sep 05 '22 at 14:08
  • Also, moving the function into a library will likely destroy all the performance gain that defining it as a single instruction and inlining it was intended to accomplish. – Eric Postpischil Sep 05 '22 at 14:09
  • @EricPostpischil: ARM SIMD instructions generally have mnemonics like `vxyz`, while the AArch64 version just uses `xyz`. I assume this is actually AArch64. – Peter Cordes Sep 05 '22 at 14:15

3 Answers3

2

const means you can't modify the variable, not that it's a compile-time constant. That's only the case if the caller passes a constant, and you compile with optimization enabled so constant-propagation can get that value to the asm statement. Even C++ constexpr doesn't require a constant expression in most contexts, it only allows it, and guarantees that compile-time constant-propagation is possible.

A stand-alone version of this function can't exist, but you didn't make it static so the compiler has to create a non-inline definition that can get called from other compilation units, even if it inlines into every call-site in this file. But this is impossible, because const int b doesn't have a known value.

For example,

int foo(const int x){
   return x*37;
}

int bar(){
   return foo(2);
}

On Godbolt compiled for AArch64: notice that foo can't just return a constant, it needs to work with a run-time variable argument, whatever value it happens to be. Only in bar with optimization enabled can it inline and not need the value of x in a register, just return a constant. (Which it used as an immediate for mov).

foo(int):
        mov     w1, 37
        mul     w0, w0, w1
        ret
bar():
        mov     w0, 74
        ret

In a shared library, your function also has to be __attribute__((visibility("hidden"))) so it can actually inline, otherwise the possibility of symbol interposition means that the compiler can't assume that foo(123) is actually going to call int foo(int) defined in the same .c

(Or static inline.)


Is there have an efficient way to avoid using 256 if-else statement?

Not sure what you're doing with your vector exactly, but if you don't have a shuffle that can work with runtime-variable counts, store to a 16-byte array can be the least bad option. But storing one byte and then reloading the whole vector will cause a store-forwarding stall, probably similar to the cost on x86 if not worse.

Doing your algorithm efficiently with AArch64 SIMD instructions is a separate question, and you haven't given enough info to figure out anything about that. Ask a different question if you want help implementing some algorithm to avoid this in the first place, or an efficient runtime-variable byte insert using other shuffles.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    ARM64's general shuffle instruction is `tbl/tbx` and it can use variable indices. But of course it is also more expensive than `ins`. – Nate Eldredge Sep 05 '22 at 14:30
  • Actually I guess there's a chicken-and-egg problem, because `tbl/tbx` needs the value `slane` in element `dlane` of the table vector, so again you need to insert a value at a non-constant index. Hmm. – Nate Eldredge Sep 05 '22 at 14:37
  • thanks for you answer, indeed we use '__attribute__((visibility("hidden")))' realize inline before, and now i know the answer of question 1. But if this function use inline Properties it can complie successfully but cant find in shared library. And i still not sure you way that "store to a 16-byte array". – SnowDance1997 Sep 05 '22 at 15:33
  • @冰煌雪舞: This function only makes sense as an inline function in a header; you wouldn't want to put it in a shared library and call it from the main program, only from inside the shared library. Function call overhead makes that pointless even if it didn't need compile-time constant arguments. It's one instruction after inlining, the `bl` alone to call it would take as much code size, and instructions to set up the args would make it a disaster. – Peter Cordes Sep 05 '22 at 15:38
  • @PeterCordes We provided source file to our user in the past with those 'inline' function, its all fine. But now my boss want just provide shared library, so those function can't be inline. Anyway, thanks for you solution, i'll talk with my boss about it. – SnowDance1997 Sep 05 '22 at 15:54
  • @冰煌雪舞: A user would be crazy to want to call this as a non-inline function; it makes no sense to provide something like this when ARM64 intrinsics already exist and do inline. Or portable SIMD-library header emulations like SIMDe (https://github.com/simd-everywhere/simde which implements "foreign" intrinsics) or SSE2NEON, or a wrapper library like Vc. IDK if the AArch64 calling convention has any call-preserved vector registers, but if not, a non-inline function call would make the compiler spill/reload all vector variables around a non-inline call. – Peter Cordes Sep 05 '22 at 16:02
  • 1
    @PeterCordes: The low 64 bits of v8-v15 are call-preserved, and that's it. I think this is for the benefit of scalar floating-point code, since these are also known as the scalar registers d8-d15. But none of the v registers are fully preserved, so yes, it'll be prohibitive in any performance-critical SIMD code. – Nate Eldredge Sep 06 '22 at 13:14
  • @NateEldredge: Cool, that sounds like a good design. It's common for scalar FP math code to call math library functions, so having some call-preserved regs for scalar FP values is a good design. Unlike x86-64 System V which unfortunately doesn't even have any non-volatile scalar FP regs. – Peter Cordes Sep 06 '22 at 13:23
1

Constraint "i" means a number. A specific number. It means you want the compiler to emit an instruction like this:

ins v0.B[2], v1.B[3]

(pardon me if me AArch64 assembly syntax isn't quite right) where v0 is the register containing res, v1 is the register containing c, 2 is the value of b (not the number of the register containing b) and 3 is the value of d (not the number of the register which containing d).

That is, if you call

vcopyq_laneq_u8_inner(something, 2, something, 3)

the instruction in the function is

ins v0.B[2], v1.B[3]

but if you call

vcopyq_laneq_u8_inner(something, 1, something, 2)

the instruction in the function is

ins v0.B[1], v1.B[2]

The compiler has to know which numbers b and d are, so it knows which instruction you want. If the function is inlined, and the parameters b and d are constant numbers, it's smart enough to do that. However, if you write this function in a way where it's not inlined, the compiler has to make an actual function that works no matter what number the b and d parameters are, and how can it possibly do that if you want it to use a different instruction depending on what they are?

The only way it could do that is to write all 256 possible instructions and switch between them depending on the parameters. However, the compiler won't do that automatically - you'd need to do it yourself. For one thing, the compiler doesn't know that b and d can only go from 0 up to 15.

You should consider either not making this a library function (it's one instruction - doesn't doing a call into a library add overhead?) or else using different instructions where the lane number can be from a register. The instruction ins copies one vector element to another. I'm not familiar with ARM vector instructions, but there should be some instructions to rearrange or select items in a vector according to a number stored in a register.

user253751
  • 57,427
  • 7
  • 48
  • 90
  • Peter Cordes's answer is better, IMO. – user253751 Sep 05 '22 at 14:25
  • 2
    There's an extra set of brackets, the desired output asm would look like `ins v0.B[2], v1.B[3]`. – Nate Eldredge Sep 05 '22 at 14:27
  • thanks for you answer, now i understand question1. What i'm doing is translate AVX instruction to NEON instruction, so user can use the C-interface to make efficient program. My idea is same with u that coding 256 if-else statement, but its Not Eficient. – SnowDance1997 Sep 05 '22 at 15:22
  • 1
    @冰煌雪舞 you said that if you make it inline, the compiler will just generate the correct instruction. So, do that? – user253751 Sep 05 '22 at 15:24
  • @user253751 yes it is, if i make it 'inline', the compiler can generate shared library. But you know, an inline function will not be a symbol in shared library. – SnowDance1997 Sep 05 '22 at 15:39
  • @冰煌雪舞 Everything in the shared library must be compiled, that means the compiler must decide which instructions to use. Instead of making it a symbol in the shared library, define it inline in the header file. It doesn't need to be a symbol in the shared library. – user253751 Sep 05 '22 at 15:40
  • @user253751 it maybe a solution. In the past we provided source file to user, but now we want just provide shared library, so we got this problem. Anyway, thx for u answer. – SnowDance1997 Sep 05 '22 at 15:49
1

But now we want to compile this function into a dynamic library, so we removed its inline keyword. But it cannot compile successfully, and error info is:
warning: asm operand 2 probably doesn't match constraints
warning: asm operand 4 probably doesn't match constraints
error: impossible constraint in 'asm'
I guess this error happend because of the inline-assembly code "i" need a "immediate integer operand"

In GCC, constraint "i" means "immediate operand", which is a value that is known at link-time or earlier, and that is an integer or an address. For example, the address of a variable in static storage is known at link time, and you can juse it just like a known value (provided the assembler supports a RELOC for it, which is beyond GCC).

but the var 'b' and 'd' is constant-var, isn't it?

const in C basically means read-only, which does not imply the value is know at link-time or earlier.

If that function was inline, and the context (hosting function and compiler optimization) is such that the values turn out to be known, then the constraints can be satisfied.

If the context is such that "i" cannot be satisfied — which is the case for a library function where you don't know the context at compile-time — then gcc will throw an error.

What you can do

One way is to supply the function as static inline in the header that accompanies the library (*.so, *.a, etc.) and describes the library interfaces and public functions. In that case the user is responsible to only use the function in appropriate contexts (or get that error message thrown at them).

Second way is to re-write the inline assembly to use instructions which can handle operands that are only known at run-time, e.g. register operands. This is usually less efficient and generates higher register pressure. In the case of a library function, you will add call-overhead just to issue one instruction.

Third way is o combine both approaches and supply the function as static inline in the library header, but write it like

static inline __attribute__((__always_inline__))
uint8x16_t vcopyq_laneq_u8_inner (uint8x16_t a, int b, uint8x16_t c, int d)
{
    uint8x16_t res;
    if (__builtin_constant_p (b) && __builtin_conpstant_p (d))
    {
        __asm__ __volatile__(
              : "ins %[dst].B[%[dlane]], %[src].B[%[sland]]"
              : [dst] "=w" (res)
              : "0" (a), [dlane] "i" (b), [src] "w" (c), [slane] "i" (d));
    }
    else
    {
        __asm__ __volatile__(
             // Use code and constraints that can handle non-"i" b and d.
    }
    return res;
}

This allows the compiler to use the optimal code when b and d are in "i", but it makes the function so generic that it will also work in a broader context.

Apart from that, nothing about that instructions seems to warrant volatile. If, for example, the return value is unused, the instruction is not needed, right? In that case, remove the volatile, which adds more freedom to schedule the inline asm.

emacs drives me nuts
  • 2,785
  • 13
  • 23