Using FPU with C inline assembly

Question

I wrote a vector structure like this:

struct vector {
    float x1, x2, x3, x4;
};

Then I created a function which does some operations with inline assembly using the vector:

struct vector *adding(const struct vector v1[], const struct vector v2[], int size) {
    struct vector vec[size];
    int i;
    
    for(i = 0; i < size; i++) {
        asm(
            "FLDL %4 \n" //v1.x1
            "FADDL %8 \n" //v2.x1
            "FSTL %0 \n"
            
            "FLDL %5 \n" //v1.x2
            "FADDL %9 \n" //v2.x2
            "FSTL %1 \n"
            
            "FLDL %6 \n" //v1.x3
            "FADDL %10 \n" //v2.x3
            "FSTL %2 \n"
            
            "FLDL %7 \n" //v1.x4
            "FADDL %11 \n" //v2.x4
            "FSTL %3 \n"
            
            :"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4)     //wyjscie
            :"g"(&v1[i].x1), "g"(&v1[i].x2), "g"(&v1[i].x3), "g"(&v1[i].x4), "g"(&v2[i].x1), "g"(&v2[i].x2), "g"(&v2[i].x3), "g"(&v2[i].x4) //wejscie
            :
        );
    }

    return vec;
}

Everything looks OK, but when I try to compile this with GCC I get these errors:

Error: Operand type mismatch for 'fadd'

Error: Invalid instruction suffix for 'fld'

On OS/X in XCode everything working correctly. What is wrong with this code?

Using `g` constraint is a very bad idea. Look at the generated asm code to see what is invalid but I guess it will be due to them. — Jester, May 29 '16 at 12:00
Not the answer, but if this is meant to be an optimisation it probably isn't. Also you have undefined behaviour in the way you're returning the result. — Flexo, May 29 '16 at 12:03
Any reason you don't use SSE/SIMD? And you shouldn't return the pointer to a local (stack based) structure. All the input operands that are "g" should be "m". You can't pass a general purpose registers (and immediates) to FLD, FADD, and FST. — Michael Petch, May 29 '16 at 12:10
I have a code with SSE @MichaelPetch, now I am doing without this and counting times, which way is faster (for school) but I deleted these lines because this is not the problem. I tried use "m" but I got a lot of errors like this: "memory input 6 is not directly addressable" — demoo, May 29 '16 at 12:17

Michael Petch · Accepted Answer · 2016-05-30T14:48:30.643

Coding Issues

I'm not looking at making this efficient (I'd be using SSE/SIMD if the processor supports it). Since this part of the assignment is to use the FPU stack then here are some concerns I have:

Your function declares a local stack based variable:

struct vector vec[size];

The problem is that your function returns a vector * and you do this:

return vec;

This is very bad. The stack based variable could get clobbered after the function returns and before the data gets consumed by the caller. One alternative is to allocate memory on the heap rather than the stack. You can replace struct vector vec[size]; with:

struct vector *vec = malloc(sizeof(struct vector)*size);

This would allocate enough space for an array of size number of vector. The person who calls your function would have to use free to deallocate the memory from the heap when finished.

Your vector structure uses float, not double. The instructions FLDL, FADDL, FSTL all operate on double (64-bit floats). Each of these instructions will load and store 64-bits when used with a memory operand. This would lead to the wrong values being loaded/stored to/from the FPU stack. You should be using FLDS, FADDS, FSTS to operate on 32-bit floats.

In the assembler templates you use the g constraint on the inputs. This means the compiler is free to use any general purpose registers, a memory operand, or an immediate value. FLDS, FADDS, FSTS do not take immediate values or general purpose registers (non-FPU registers) so if the compiler attempts to do so it will likely produce errors similar to Error: Operand type mismatch for xxxx.

Since these instructions understand a memory reference use m instead of g constraint. You will need to remove the & (ampersands) from the input operands since m implies that it will be dealing with the memory address of a variable/C expression.

You don't pop the values off the FPU stack when finished. FST with a single operand copies the value at the top of the stack to the destination. The value on the stack remains. You should store it and pop it off with an FSTP instruction. You want the FPU stack to be empty when your assembler template ends. The FPU stack is very limited with only 8 slots available. If the FPU stack is not clear when the template completes then you run the risk of the FPU stack overflowing on subsequent calls. Since you leave 4 values on the stack on each call, calling the function adding a third time should fail.

To simplify the code a bit I'd recommend using a typedef to define vector. Define your structure this way:

typedef struct {
    float x1, x2, x3, x4;
} vector;

All references to struct vector can simply become vector.

With all of these things in mind your code could look something like this:

typedef struct {
    float x1, x2, x3, x4;
} vector;

vector *adding(const vector v1[], const vector v2[], int size) {
    vector *vec = malloc(sizeof(vector)*size);
    int i;

    for(i = 0; i < size; i++) {
        __asm__(
            "FLDS %4 \n" //v1.x1
            "FADDS %8 \n" //v2.x1
            "FSTPS %0 \n"

            "FLDS %5 \n" //v1.x2
            "FADDS %9 \n" //v2.x2
            "FSTPS %1 \n"

            "FLDS %6 \n" //v1->x3
            "FADDS %10 \n" //v2->x3
            "FSTPS %2 \n"

            "FLDS %7 \n" //v1->x4
            "FADDS %11 \n" //v2->x4
            "FSTPS %3 \n"

            :"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4)
            :"m"(v1[i].x1), "m"(v1[i].x2), "m"(v1[i].x3), "m"(v1[i].x4),
             "m"(v2[i].x1), "m"(v2[i].x2), "m"(v2[i].x3), "m"(v2[i].x4)
            :
        );
    }

    return vec;
}

Alternative Solutions

I don't know the parameters of the assignment, but if it were to make you use GCC extended assembler templates to manually do an operation on the vector with an FPU instruction then you could define the vector with an array of 4 float. Use a nested loop to process each element of the vector independently passing each through to the assembler template to be added together.

Define the vector as:

typedef struct {
    float x[4];
} vector;

The function as:

vector *adding(const vector v1[], const vector v2[], int size) {
    int i, e;
    vector *vec = malloc(sizeof(vector)*size);

    for(i = 0; i < size; i++)
        for (e = 0; e < 4; e++)  {
            __asm__(
                "FADDPS\n"
                :"=t"(vec[i].x[e])
                :"0"(v1[i].x[e]), "u"(v2[i].x[e])
        );
    }

    return vec;
}

This uses the i386 machine constraints t and u on the operands. Rather than passing a memory address we allow GCC to pass them via the top two slots on the FPU stack. t and u are defined as:

t
Top of 80387 floating-point stack (%st(0)).

u
Second from top of 80387 floating-point stack (%st(1)).

The no operand form of FADDP does this:

Add ST(0) to ST(1), store result in ST(1), and pop the register stack

We pass the two values to add at the top of the stack and perform an operation leaving ONLY the result in ST(0). We can then get the assembler template to copy the value on the top of the stack and pop it off automatically for us.

We can use an output operand of =t to specify the value we want moved is from the top of the FPU stack. =t will also pop (if needed) the value off the top of FPU stack for us. We can also use the top of the stack as an input value too! If the output operand is %0 we can reference it as an input operand with the constraint 0 (which means use the same constraint as operand 0). The second vector value will use the u constraint so it is passed as the second FPU stack element (ST(1))

A slight improvement that could potentially allow GCC to optimize the code it generates would be to use the % modifier on the first input operand. The % modifier is documented as:

Declares the instruction to be commutative for this operand and the following operand. This means that the compiler may interchange the two operands if that is the cheapest way to make all operands fit the constraints. ‘%’ applies to all alternatives and must appear as the first character in the constraint. Only read-only operands can use ‘%’.

Because x+y and y+x yield the same result we can tell the compiler that it can swap the operand marked with % with the one defined immediately after in the template. "0"(v1[i].x[e]) could be changed to "%0"(v1[i].x[e])

Disadvantages: We've reduced the code in the assembler template to a single instruction, and we've used the template to do most of the work setting things up and tearing it down. The problem is that if the vectors are likely going to be memory bound then we transfer between FPU registers and memory and back more times than we may like it to. The code generated may not be very efficient as we can see in this Godbolt output.

We can force memory usage by applying the idea in your original code to the template. This code may yield more reasonable results:

vector *adding(const vector v1[], const vector v2[], int size) {
    int i, e;
    vector *vec = malloc(sizeof(vector)*size);

    for(i = 0; i < size; i++)
        for (e = 0; e < 4; e++)  {
            __asm__(
                "FADDS %2\n"
            :"=&t"(vec[i].x[e])
            :"0"(v1[i].x[e]), "m"(v2[i].x[e])
        );
    }

    return vec;
}

Note: I've removed the % modifier in this case. In theory it should work, but GCC seems to emit less efficient code (CLANG seems okay) when targeting x86-64. I'm unsure if it is a bug; whether my understanding is lacking in how this operator should work; or there is an optimization being done I don't understand. Until I look at it closer I am leaving it off to generate the code I would expect to see.

In the last example we are forcing the FADDS instruction to operate on a memory operand. The Godbolt output is considerably cleaner, with the loop itself looking like:

.L3:
        flds    (%rdi)  # MEM[base: _51, offset: 0B]
        addq    $16, %rdi       #, ivtmp.6
        addq    $16, %rcx       #, ivtmp.8
        FADDS (%rsi)    # _31->x

        fstps   -16(%rcx)     # _28->x
        addq    $16, %rsi       #, ivtmp.9
        flds    -12(%rdi)       # MEM[base: _51, offset: 4B]
        FADDS -12(%rsi) # _31->x

        fstps   -12(%rcx)     # _28->x
        flds    -8(%rdi)        # MEM[base: _51, offset: 8B]
        FADDS -8(%rsi)  # _31->x

        fstps   -8(%rcx)      # _28->x
        flds    -4(%rdi)        # MEM[base: _51, offset: 12B]
        FADDS -4(%rsi)  # _31->x

        fstps   -4(%rcx)      # _28->x
        cmpq    %rdi, %rdx      # ivtmp.6, D.2922
        jne     .L3       #,

In this final example GCC unwound the inner loop and only the outer loop remains. The code generated by the compiler is similar in nature to what was produced by hand in the original question's assembler template.

Instead of malloc, I would have suggested having the caller pass a `dst` pointer. That allows reuse of the same buffer across multiple calls, for example. — Peter Cordes, May 30 '16 at 00:48
@PeterCordes I considered that, and your suggestion is reasonable. In this case I wasn't going to change the prototype since I was unsure if they are provided as part of the assignment. — Michael Petch, May 30 '16 at 00:49
I think with the right contraints, you could get gcc to produce the same optimal asm, but still only having a single instruction in the asm statement. i.e. ask for one input operand to be on the top of the FPU stack, instead of in memory. Then in a context where both inputs are the same, gcc can emit `fadds %st(0)` (unless the `s` operand-size suffix is a syntax error in that context! You might need intel-syntax mode to get gcc to supply the operand-size on the operand). Also, I think you have `%1` and `%2` reversed in your final asm statement. The `flds` should use the `"m"`-only operand. — Peter Cordes, May 30 '16 at 00:54
@PeterCordes And yes, the syntax error was in fact the reason I went down that route. I wasn't going to muddy the waters by introducing separate Intel syntax, although I suspect that would work. — Michael Petch, May 30 '16 at 01:00
@PeterCordes I stand corrected the input operands should have both been `m` since adding `s` to the instruction as a suffix doesn't allow the FPU register to be utilized (syntax error) with AT&T syntax.I'll modify the answer to reflect that. In theory I could have left off the suffix and let the compiler default to 32-bit, but felt that it might seem inconsistent, and would cause an issue if they changed to using double (runtime failure) — Michael Petch, May 30 '16 at 01:07
I have a preference for the suffix being present for clarity.Give the discussion I gave I'd rather see _S_ or _L_ on the end as it is much clearer about the size being operated on. Leaving it off an allowing the compiler to silently pick seems more problematic if one goes from floats to doubles. One might incorrectly assume the assembler template would somehow properly determine whether the operands were floats or doubles. — Michael Petch, May 30 '16 at 01:22
I agree it doesn't make sense to talk about `-masm=intel` to get the compiler to substitute `dword ptr ...` for you, and that your code now is the best choice for this answer. I also don't like letting the assembler pick a default memory operand size for FP instructions. If I was ever going to do that, to allow a reg-or-memory operand in a case where it would actually get used both ways in an existing codebase, I'd have to put a big comment explaining the risks and the reason why it was worth it. — Peter Cordes, May 30 '16 at 01:24
`typedef struct _vector` can be just `typedef struct { ... } vector;`. Or preferably some different name, because `std::vector` is ubiquitous in C++. It's like using `printf` as a function name in some other language for something that doesn't work anything like `printf(3)`. — Peter Cordes, May 30 '16 at 01:29
@PeterCordes Originally when I did this I was using `%0` constraint taking *potential* advantage of the commutative property of add. The code generated looked unusual and I went down the path of doing the load and fadd in the template. It wasn't until I removed the `%` and went with `0` that I saw the code generation I was expecting. Although my last edits suggest I may not know what I am doing, I'm nearly positive there is an optimizer bug with x86-64 and FPU code. _CLANG_ doesn't seem to give this behavior. The code generated by _GCC_ seems to regress back to code generated in GCC 4.7-4.9.0 — Michael Petch, May 30 '16 at 14:08
One can see the difference if you take the last code alternative in my answer and change `:"0"(v1[i].x[e]), "m"(v2[i].x[e])` to `:"%0"(v1[i].x[e]), "m"(v2[i].x[e])` — Michael Petch, May 30 '16 at 14:15
Interesting observation. Clearly code-generation for x87 inline asm with `-mfpmath=sse` isn't a priority for gcc developers. Using `"%0"` to get commutative inputs apparently lets gcc "think too hard" about this, and it ends up copying them to / from a local on the stack using `movss`. [`-mfpmath=387` makes good asm with or without `"%0"`](https://godbolt.org/g/KUILM4). I'm not planning to file a bug report for this, since x87 is obsolete and gcc devs don't need to be spending time on inline asm for it, as long as the generated code actually runs correctly. — Peter Cordes, May 30 '16 at 14:59
@PeterCordes : Oh, it had crossed my mind to file a bug when I made the observation, but as it stands the value in it is very limited since what is being done here is in most cases a fringe case that you'd probably only see happen in academia (at least with x86-64 code). My comment about it in the answer only said it produced *inefficient* code, but the code itself does in fact work as expected albeit in a round about way. — Michael Petch, May 30 '16 at 15:05
Yup, I didn't think there was any correctness problem. I could imagine using x87 inline asm for code-size reasons, e.g. for `fsincos` instead of a library with SSE insns, even in "modern" code, but maybe only for a memory-constrained embedded system. — Peter Cordes, May 30 '16 at 15:10

Using FPU with C inline assembly

1 Answers1

Coding Issues

Alternative Solutions

Linked