Concrete example of incorrect behavior of an early-clobber affecting a memory operand's addressing mode in GCC inline asm?

Question

Below is excerpted from the GCC manual's Extended Asm docs, on embedding assembly instructions in C using asm keyword:

The same problem can occur if one output parameter (a) allows a register constraint and another output parameter (b) allows a memory constraint. The code generated by GCC to access the memory address in b can contain registers which might be shared by a, and GCC considers those registers to be inputs to the asm. As above, GCC assumes that such input registers are consumed before any outputs are written. This assumption may result in incorrect behavior if the asm statement writes to a before using b. Combining the ‘&’ modifier with the register constraint on a ensures that modifying a does not affect the address referenced by b. Otherwise, the location of b is undefined if a is modified before using b.

The italic sentence says there may be "incorrect behavior" if the asm statement writes to a before using b.

I cannot figure out how such an "incorrect behavior" could have occurred, so I wish to have a concrete asm code example to demonstrate the "incorrect behavior" so that I could have a deep understanding of this paragraph.

I can perceive the problem when two such asm codes are running in parallel, but the above paragraph does not mention multiprocessing scenario.

If we have only one CPU with one core, can you please show an asm code that may produce such an incorrect behavior, that is, modifying a affects the address referenced by b such that the location of b is undefined.

The only assembly language I am familiar with is Intel x86 assembly, so please make the example targeted on that platform.

The irony of the situation is that many people come asking on SO because their code is broken :D — Jester, May 22 '21 at 00:16
@TedLyngmo: GCC’s inline assembly feature is conforming C code, as defined by the C standard, and is, in a number of situations, useful while programming in C. — Eric Postpischil, May 22 '21 at 00:40
@EricPostpischil I see how mixing languages can work. I just think C is unnecessary here. Why is a wrapper language in it? Your answer is probably fine. — Ted Lyngmo, May 22 '21 at 00:50
BTW, every core (and every software thread via context switches) has its own *private* registers. Multi-threading doesn't work by randomly mixing instructions from different threads operating on the same architectural state. There are only a very few architectural registers (16 or 32 in common modern ISAs), and code (hand-written or compiler-generated) relies on them keeping the values you put in them. IDK if you mentioned multi-threading because you picture that's how it normally works, or you're just inventing hypothetical weird things that you could imagine causing breakage. — Peter Cordes, May 22 '21 at 06:34

Jester · Accepted Answer · 2021-05-22T01:20:02.483

Consider the following example:

extern int* foo();
int bar()
{
    int r;

    __asm__(
        "mov $0, %0 \n\t"
        "add %1, %0"
    : "=r" (r) : "m" (*foo()));

    return r;
}

The usual calling convention puts return values into the eax register. As such, there is a good chance the compiler decides to use eax throughout, to avoid unnecessary copying. The generated assembly may look like:

        subl    $12, %esp
        call    foo
        mov $0, %eax
        add (%eax), %eax
        addl    $12, %esp
        ret

Notice that the mov $0, %eax zeroes eax before the next instruction attempts to use it for referencing the input argument, hence this code will crash. With early clobber, you force the compiler to pick different registers. In my case, the resulting code was:

        subl    $12, %esp
        call    foo
        mov $0, %edx
        add (%eax), %edx
        addl    $12, %esp
        movl    %edx, %eax
        ret

The compiler could have instead moved the result of foo() into edx (or any other free register), like this:

        subl    $12, %esp
        call    foo
        mov     %eax, %edx
        mov $0, %eax
        add (%edx), %eax
        addl    $12, %esp
        ret

This example used the memory constraint for an input argument, but the concept applies equally to outputs too.

Indeed, https://godbolt.org/z/x4WMY7dns shows exactly the breakage you're talking about, and that an `"=&r"` early-clobber output avoids it. (With GCC11 for x86-64) — Peter Cordes, May 22 '21 at 02:31
Awesome example. Not only does it demonstrate the incorrect behavior, but also It illustrates “and GCC considers those registers to be inputs to the asm. As above, GCC assumes that such input registers are consumed before any outputs are written.” %eax is an input to the asm, so GCC assumes %eax is consumed already and consequently uses it for output to generate "mov $0, %eax". This write causes incorrect behavior because the aforementioned assumption does not hold: the input %eax is being consumed until the output (same location for optimization purpose) is written. GCC's assumption is ... — zzzhhh, May 22 '21 at 23:13
reasonable, so it should be respected by programmer to allocate other registers for output by using "=&r" as the first correct code, or to manually save input %eax to somewhere else as the second correct code (a way of consumption). — zzzhhh, May 22 '21 at 23:18

Eric Postpischil · Answer 2 · 2021-05-22T09:41:54.790

6

Given the code below, Apple Clang 11 with -O3 uses (%rax) for the a and %eax for b.

void foo(int *a)
{
    __asm__(
            "nop    # a is %[a].\n"
            "nop    # b is %[b].\n"
            "nop    # c is %[c].\n"
            "nop    # d is %[d].\n"
            "nop    # e is %[e].\n"
            "nop    # f is %[f].\n"
            "nop    # g is %[g].\n"
            "nop    # h is %[h].\n"
            "nop    # i is %[i].\n"
            "nop    # j is %[j].\n"
            "nop    # k is %[k].\n"
            "nop    # l is %[l].\n"
            "nop    # m is %[m].\n"
            "nop    # n is %[n].\n"
            "nop    # o is %[o].\n"
        :
            [a] "=m" (a[ 0]),
            [b] "=r" (a[ 1]),
            [c] "=r" (a[ 2]),
            [d] "=r" (a[ 3]),
            [e] "=r" (a[ 4]),
            [f] "=r" (a[ 5]),
            [g] "=r" (a[ 6]),
            [h] "=r" (a[ 7]),
            [i] "=r" (a[ 8]),
            [j] "=r" (a[ 9]),
            [k] "=r" (a[10]),
            [l] "=r" (a[11]),
            [m] "=r" (a[12]),
            [n] "=r" (a[13]),
            [o] "=r" (a[14])
        );
}

So, if the nop instructions and comments were replaced with actual instructions that wrote to %[b] before %[a], they would destroy the address needed for %[a].

edited May 22 '21 at 09:41

answered May 22 '21 at 00:33

Eric Postpischil

195,579
13
168
312

3

I was wondering why GCC and clang are so reluctant to reuse a reg, preferring even to save/restore call-preserved regs like RBX. Turns out it's just because it needs the array pointer `int *a` after the asm statement, to get the outputs from their registers into `a[...]`. Only if you fully max out the number of reg outputs you can have will gcc or clang choose EDI as an output https://godbolt.org/z/9eeWfezzo. (With clang doing a really horrible job in the code after, spilling them *all* to the stack instead of just 1 to make room to reload the pointer, then copying 4B at a time.) – Peter Cordes May 22 '21 at 02:22
2

`bar()` in the same Godbolt link just returns the sum of 2 of the output, not storing to `a[]` after the asm statement; then clang will happily choose EDI as an output before picking any of R8D..R15D, just 5 register outputs. (GCC still picks R8D before that, though.) – Peter Cordes May 22 '21 at 02:25
2

And BTW, you used `[h]` twice. GCC, and mainline clang, [both error on that](https://godbolt.org/z/YvW69zzPe); I assume Apple Clang does to, and that was a copy-paste glitch. Also, for ease of Godbolt use, I often do `nop # comment` so the line isn't filtered out. (GCC does a pure text substitution so it doesn't even have to be valid asm, unlike clang's built-in assembler) – Peter Cordes May 22 '21 at 02:28
2

@PeterCordes: Re “I assume Apple Clang does [too]”: It does not! – Eric Postpischil May 22 '21 at 09:42
No error diagnostic in Clang until v9. That's quite surprising. Time to update your Mac! :-) – Cody Gray - on strike May 22 '21 at 10:01
@CodyGray: I wish I could. I am dependent on 32-bit Quicken 2007. macOS past 10.14.6 does not support 32-bit, and later versions of Quicken do not have features I depend on, and Quicken alternatives have not when I surveyed them in the past. And I am at the limit of Apple’s developer tools for macOS 10.14.6. I am working on setting up 10.14.6 in VirtualBox so I can keep Quicken while I update the real system to a later OS. – Eric Postpischil May 22 '21 at 10:06

Concrete example of incorrect behavior of an early-clobber affecting a memory operand's addressing mode in GCC inline asm?

2 Answers2