13

I try to understand the implication of System V AMD64 - ABI's calling convention and looking at the following example:

struct Vec3{
    double x, y, z;
};

struct Vec3 do_something(void);

void use(struct Vec3 * out){
    *out = do_something();
}

A Vec3-variable is of type MEMORY and thus the caller (use) must allocate space for the returned variable and pass it as hidden pointer to the callee (i.e. do_something). Which is what we see in the resulting assembler (on godbolt, compiled with -O2):

use:
        pushq   %rbx
        movq    %rdi, %rbx           ;remember out
        subq    $32, %rsp            ;memory for returned object
        movq    %rsp, %rdi           ;hidden pointer to %rdi
        call    do_something
        movdqu  (%rsp), %xmm0        ;copy memory to out
        movq    16(%rsp), %rax
        movups  %xmm0, (%rbx)
        movq    %rax, 16(%rbx)
        addq    $32, %rsp            ;unwind/restore
        popq    %rbx
        ret

I understand, that an alias of pointer out (e.g. as global variable) could be used in do_something and thus out cannot be passed as hidden pointer to do_something: if it would, out would be changed inside of do_something and not when do_something returns, thus some calculations might become faulty. For example this version of do_something would return faulty results:

struct Vec3 global; //initialized somewhere
struct Vec3 do_something(void){
   struct Vec3 res;
   res.x = 2*global.x; 
   res.y = global.y+global.x; 
   res.z = 0; 
   return res;
}

if out where an alias for the global variable global and were used as hidden pointer passed in %rdi, res were also an alias of global, because the compiler would use the memory pointed to by hidden pointer directly (a kind of RVO in C), without actually creating a temporary object and copying it when returned, then res.y would be 2*x+y(if x,y are old values of global) and not x+y as for any other hidden pointer.

It was suggested to me, that using restrict should solve the problem, i.e.

void use(struct Vec3 *restrict out){
    *out = do_something();
}

because now, the compiler knows, that there are no aliases of out which could be used in do_something, so the assembler could be as simple as this:

use:
    jmp     do_something ; %rdi is now the hidden pointer

However, this is not the case neither for gcc nor for clang - the assembler stays unchanged (see on godbolt).

What prevents the usage of out as hidden pointer?


NB: The desired (or very similar) behavior would be achieved for a slightly different function-signature:

struct Vec3 use_v2(){
    return do_something();
}

which results in (see on godbolt):

use_v2:
    pushq   %r12
    movq    %rdi, %r12
    call    do_something
    movq    %r12, %rax
    popq    %r12
    ret
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
ead
  • 32,758
  • 6
  • 90
  • 153
  • I don't understand what an "hidden pointer" is. Is it a synonyme for "opaque pointer" ? I don't see a problem with this code, granted that "out" must be allocated before "use" is called. – Tom's Aug 06 '19 at 13:44
  • @Tom's please see https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI. I'm asking why the resulting assembler isn't as optimal as I would like. – ead Aug 06 '19 at 13:47
  • Thanks ! I didn't understand your question before. That "hidden" parameter is handful, indeed, I always believed that the "this" argument in C++ was added by the c++ compilator (like the mangling is done), but it seems that they "simply" use the "hidden" parameter ... Sorry, I don't have a clue for your question :/ – Tom's Aug 06 '19 at 13:54
  • what sense write such code `Vec3 do_something();` and force compiler use hidden pointer ? need **explicit** write `void do_something(Vec3* )` - because this is only way (impossible return `Vec3`). so if you want optimized binary code - you must at begin yourself write optimized source code – RbMm Aug 07 '19 at 11:52
  • 2
    @RbMm the whole idea of compiled language is to allow compilers to optimize – M.M Aug 07 '19 at 23:19
  • @M.M - but for what try return value which is impossible return ? simply need understand that `Vec3` impossible return (by value) and write `void do_something(Vec3* )` (anyway this is will be real signature of function) – RbMm Aug 07 '19 at 23:48

3 Answers3

5

A function is allowed to assume its return-value object (pointed-to by a hidden pointer) is not the same object as anything else. i.e. that its output pointer (passed as a hidden first arg) doesn't alias anything.

You could think of this as the hidden first arg output pointer having an implicit restrict on it. (Because in the C abstract machine, the return value is a separate object, and the x86-64 System V specifies that the caller provides space. x86-64 SysV doesn't give the caller license to introduce aliasing.)

Using an otherwise-private local as the destination (instead of separate dedicated space and then copying to a real local) is fine, but pointers that may point to something reachable another way must not be used. This requires escape analysis to make sure that a pointer to such a local hasn't been passed outside of the function.

I think the x86-64 SysV calling convention models the C abstract machine here by having the caller provide a real return-value object, not forcing the callee to invent that temporary if needed to make sure all the writes to the retval happened after any other writes. That's not what "the caller provides space for the return value" means, IMO.

That's definitely how GCC and other compilers interpret it in practice, which is a big part of what matters in a calling convention that's been around this long (since a year or two before the first AMD64 silicon, so very early 2000s).


Here's a case where your optimization would break if it were done:

struct Vec3{
    double x, y, z;
};
struct Vec3 glob3;

__attribute__((noinline))
struct Vec3 do_something(void) {  // copy glob3 to retval in some order
    return (struct Vec3){glob3.y, glob3.z, glob3.x};
}

__attribute__((noinline))
void use(struct Vec3 * out){   // copy do_something() result to *out
    *out = do_something();
}


void caller(void) {
    use(&glob3);
}

With the optimization you're suggesting, do_something's output object would be glob3. But it also reads glob3.

A valid implementation for do_something would be to copy elements from glob3 to (%rdi) in source order, which would do glob3.x = glob3.y before reading glob3.x as the 3rd element of the return value.

That is in fact exactly what gcc -O1 does (Godbolt compiler explorer)

do_something:
    movq    %rdi, %rax               # tmp90, .result_ptr
    movsd   glob3+8(%rip), %xmm0      # glob3.y, glob3.y
    movsd   %xmm0, (%rdi)             # glob3.y, <retval>.x
    movsd   glob3+16(%rip), %xmm0     # glob3.z, _2
    movsd   %xmm0, 8(%rdi)            # _2, <retval>.y
    movsd   glob3(%rip), %xmm0        # glob3.x, _3
    movsd   %xmm0, 16(%rdi)           # _3, <retval>.z
    ret     

Notice the glob3.y, <retval>.x store before the load of glob3.x.

So without restrict anywhere in the source, GCC already emits asm for do_something that assumes no aliasing between the retval and glob3.


I don't think using struct Vec3 *restrict out wouldn't help at all: that only tells the compiler that inside use() you won't access the *out object through any other name. Since use() doesn't reference glob3, it's not UB to pass &glob3 as an arg to a restrict version of use.

I may be wrong here; @M.M argues in comments that *restrict out might make this optimization safe because the execution of do_something() happens during out(). (Compilers still don't actually do it, but maybe they would be allowed to for restrict pointers.)

Update: Richard Biener said in the GCC missed-optimization bug-report that M.M is correct, and if the compiler can prove that the function returns normally (not exception or longjmp), the optimization is legal in theory (but still not something GCC is likely to look for):

If so, restrict would make this optimization safe if we can prove that do_something is "noexcept" and doesn't longjmp.

Yes.

There's a noexecpt declaration, but there isn't (AFAIK) a nolongjmp declaration you can put on a prototype.

So that means it's only possible (even in theory) as an inter-procedural optimization when we can see the other function's body. Unless noexcept also means no longjmp.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • If it were `void use(struct Vec3 * restrict out)`, then the code would have undefined behaviour, therefore this is not a counterexample for the claim that in the presence of `restrict` the optimization in question would be allowed – M.M Aug 07 '19 at 23:30
  • @M.M: I don't think that's true. `use()` doesn't directly read `glob3`; that's in a separate function outside the scope of the `*restrict out` variable. Can you explain in more detail why that's wrong, if you still think so? – Peter Cordes Aug 07 '19 at 23:34
  • 2
    Informally, `void use(struct Vec3 * restrict out)` means that it is UB if, *at any points during the execution of `use`*, the memory location designated by `*out` is accessed via `out` and also accessed in another way not via `out`, Which does happen here, since `use` writes `*out` directly, and also `do_something()` (called within `use`) reads the same memory location via the name `glob3`.. Formally, see C17 6.7.3.1 – M.M Aug 07 '19 at 23:43
  • @M.M: I think you're probably right. The wording in the standard talks about the lifetime of a block, and doesn't appear to rule out function calls that do execution outside the scope of the current block B but later return back into it. Anyway, the main point and example of my answer (about the no `restrict` case) doesn't depend on this argument at all. It definitely shows that GCC code-gen assumes that the return-value pointer doesn't alias anything else. – Peter Cordes Aug 08 '19 at 00:35
  • Differently as JohnBollinger, you argue that the hidden pointer must be restrict, but can otherwise point to any memory location (and not only memory being in the stack-frame of the caller of the caller's caller). Thus, given the @M.M's interpertation of restrict is correct, it is a missed optimization (which every compiler out there makes)? – ead Aug 08 '19 at 07:05
  • I agree, that x86-64 SysV doesn't give the caller license to introduce aliasing, but does is give the licence to callee to assume that the hidden pointer is restrict? Or is it a calling convention of the gcc-compiler, because only one copying in the caller would be enough (no need to copy in the callee as well)? – ead Aug 08 '19 at 07:08
  • @ead: Yes, I can't see any reason why the callee could care about the return-value pointer being into the caller's stack-frame or not. The ABI doesn't give it any mechanism to even check for that. (Except maybe `.eh_frame` stack unwinding stuff.) But yes, without `restrict` and M.M's interpretation of it, the caller could only do this if it knew which globals and other objects a function reads so it could ensure it didn't create aliasing for the callee. (e.g. if the definition was visible but it still chose not to inline it). – Peter Cordes Aug 08 '19 at 07:10
  • @ead: I think the wording in the x86-64 SysV ABI is the key to treating the output pointer as if it were restrict, that the caller provides space. In the C abstract machine, the return-value object exists and is separate from the lvalue you assign the result to (if you assign it at all), as well as separate from any locals in the callee. So the calling convention has to make it possible to respect that. I think that fact should inform our interpretation of the calling-convention wording. Passing a pointer to another C object would be an *optimization*, not something normal. – Peter Cordes Aug 08 '19 at 07:14
  • @ead: So I think the x86-64 SysV calling convention models the C abstract machine here by having *the caller* provide a real return-value object, not forcing the *callee* to invent that temporary if needed to make sure all the writes to the retval happened after any other writes. That's not what "the caller provides space" means, IMO. – Peter Cordes Aug 08 '19 at 07:17
  • @ead: What gcc/clang/ICC do basically *is* how you need to interpret the x86-64 System V ABI. There are a few other relevant compilers, like SunCC if it still exists. The purpose of the document is to enable interoperability between their asm and existing compilers, and to document the rules those compilers follow on Linux, OS X, and other OSes using that ABI. So if there's ambiguity, that's normally considered a problem with the document, not with GCC. – Peter Cordes Aug 08 '19 at 07:24
  • 1
    @ead, Peter's argument is couched in quite different terms, but fundamentally, it is *not* different from mine. Peter has insightfully recognized the analogy with the semantics of C's `restrict`, whereas I go directly from ABI requirements to the implications for code generation, but either way, the point is that the non-aliasing requirements on the hidden pointer effectively require the caller to provide a pointer to space that belongs to it, thus ruling out the optimization you're looking for (in the original case). Stack allocation of that data or not is not a central issue. – John Bollinger Aug 08 '19 at 13:01
  • @M.M: It seems you were correct about the interpretation of `restrict`. Richard Biener (a gcc dev) thinks so, anyway, as long as GCC can prove the target function is `noexcept` and doesn't `longjmp`. So that means it's only possible as an inter-procedural optimization when we can see the other function's body; there's a `noexecpt` declaration but there isn't (AFAIK) a `nolongjmp`. Updated my answer with a link. – Peter Cordes Aug 13 '19 at 07:22
  • @PeterCordes Good point about `longjmp` (and perhaps various non-standard constructs we can't think of right now?) – M.M Aug 13 '19 at 10:48
  • @M.M: yup, that was well spotted [by joseph@codesourcery.com](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91398#c1) in the GCC bug report that \@ead opened for this. – Peter Cordes Aug 13 '19 at 10:50
2

Substantially rewritten:

I understand, that an alias of pointer out (e.g. as global variable) could be used in do_something and thus [out] cannot be passed as hidden pointer to do_something: if it would, out would be changed inside of do_something and not when do_something returns, thus some calculations might become faulty.

Except with respect to aliasing considerations inside do_something(), the difference in timing with respect to when *out is modified is irrelevant in the sense that use()'s caller cannot tell the difference. Such issues arise only with respect to accesses from other threads, and if that's a possibility then they arise anyway unless appropriate synchronization is applied.

No, the issue is primarily that the ABI defines how passing arguments to functions and receiving their return values works. It specifies that

If the type has class MEMORY, then the caller provides space for the return value and passes the address of this storage in %rdi

(emphasis added).

I grant that there's room for interpretation, but I take that as a stronger statement than just that the caller specifies where to store the return value. That it "provides" space means to me that the space in question belongs to the caller (which your *out does not). By analogy with argument passing, there's good reason to interpret that more specifically as saying that the caller provides space on the stack (and therefore in its own stack frame) for the return value, which in fact is exactly what you observe, though that detail doesn't really matter.

With that interpretation, the called function is free to assume that the return-value space is disjoint from any space it can access via any pointer other than one of its arguments. That this is supplemented by a more general requirement that the return space not be aliased (i.e. not through the function arguments either) does not contradict that interpretation. It may therefore perform operations that would be incorrect if in fact the space were aliased to something else accessible to the function.

The compiler is not at liberty to depart from the ABI specifications if the function call is to work correctly with a separately-compiled do_something() function. In particular, with separate compilation, the compiler cannot make decisions based on characteristics of the function's caller, such as aliasing information known there. If do_something() and use() were in the same translation unit, then the compiler might choose to inline so_something() into use(), or it might choose to perform the optimization you're looking for without inlining, but it cannot safely do so in the general case.

It was suggested to me, that using restrict should solve the problem,

restrict gives the compiler greater leeway to optimize, but that in itself does not give you any reason to expect specific optimizations that might then be possible. In fact, the language standard explicitly specifies that

A translator is free to ignore any or all aliasing implications of uses of restrict.

(C2011, 6.7.3.1/6)

restrict-qualifying out expresses that the compiler doesn't need to worry about it being aliased to any other pointer accessed within the scope of a call to use(), including during the execution of functions other functions it calls. In principle, then, I could see a compiler taking advantage of that to shortcut the ABI by offering somebody else's space for the return value instead of providing space itself, but just because it could do does not mean that it will do.

What prevents the usage of out as hidden pointer?

ABI compliance. The caller is expected to provide space that belongs to it, not to someone else, for storage of the return value. As a practical matter, however, I don't see anything in the restrict-qualified case that would invalidate shortcutting the ABI, so I take it that that's just not an optimization that has been implemented by the compiler in question.

NB: The desired (or very similar) behavior would be achieved for a slightly different function-signature: [...]

That case looks like a tail-call optimization to me. I don't see anything inherently inconsistent in the compiler performing that optimization, but not the one you're asking about, even though it is, to be sure, a different example of shortcutting the ABI.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • 1
    I don't understand, what in the ABI prevents `out` to be passed as hidden pointer - `do_something` doesn't care where `%rdi` points to - whether it is in the stack of the caller or somewhere else. – ead Aug 06 '19 at 14:15
  • But I think my misundertanding is what `restrict` does - it only means "no aliasing" in the `use` and not in the whole code. Thanks. – ead Aug 06 '19 at 14:20
  • @ead, I have rewritten this answer with rather greater attention to some of the details of the ABI specification and of the C standard, coming to a somewhat softer conclusion. – John Bollinger Aug 06 '19 at 15:51
  • I disagree with "The difference in timing with respect to when *out is modified is irrelevant in the sense that use()'s caller cannot tell the difference..." I have tried to make it clear with an example in my question. – ead Aug 07 '19 at 18:28
  • I'm also not entirely convinced that "the caller provides space for the return value", means what you say - I cannot see, how callee can make any assumptions other that there will be memory pointed to by the hidden pointer. – ead Aug 07 '19 at 18:31
  • @ead, since you seemed originally to address timing effects and aliasing effects as separate considerations, I responded to them as such. The example now in your question suggests that you really mean these as one and the same thing, and that in fact highlights the key assumption that the ABI permits the callee to make: that the return-value storage does not overlap any data visible to it through any other identifier. The ABI requires the *caller* to ensure this, which it can do only by relying on its own caller (in the TCO case) or by providing storage that it itself owns. – John Bollinger Aug 07 '19 at 18:55
  • IMO it would be a legal optimization in the `restrict` case to just write to `out` but I expect just that no compiler developer has investigated doing it yet – M.M Aug 07 '19 at 23:18
  • *A translator is free to ignore any or all aliasing implications of uses of restrict.* I think your reasoning is backwards here. It's UB to pass aliasing pointers with `restrict` whether any specific compiler optimizes based on that or not. Plus, I think the real issue is (as I said in my answer) that the hidden pointer to the output object is implicitly `restrict`, and that applies to `do_something` not `use`. @M.M I don't think you're correct, unless I'm misunderstanding the scope of `restrict`. – Peter Cordes Aug 07 '19 at 23:30
  • @PeterCordes, it is not passing aliased pointers to `restrict`-qualified parameters that produces UB, but rather accessing restrict-pointed-to data via an alias that does. If such UB occurs, it occurs at execution time, not translation time, because it is data dependent. Since the OP is asking about why a particular implementation performs translation as it does with and without `restrict`, the Standard's explicit remark disclaiming any requirement for `restrict`-qualification to yield different translation seems entirely on point to me. – John Bollinger Aug 08 '19 at 11:55
  • 1
    Oh yes, I see what you're saying now. Just that even if this optimization is allowed with `restrict`, compilers might not look for it. – Peter Cordes Aug 08 '19 at 11:59
  • However, @PeterCordes, I agree that it is a reasonable characterization of the ABI specifications to say that the hidden pointer is effectively `restrict`-qualified, and that that applies to `do_something()`, not `use()`. I have approached that point from the direction of its implications for code generation, largely because I did not previously recognize that connection, but, retrospectively, because the semantics of `restrict` are a bit arcane for many people. – John Bollinger Aug 08 '19 at 12:02
2

The answers of @JohnBollinger and @PeterCordes cleared a lot of things for me, but I decided to bug gcc-developers. Here is how I understand their answer.

As @PeterCordes has pointed out, the callee assumes, that the hidden pointer is restrict. However it makes also another (less obvious) assumption: the memory to which the hidden pointer points is uninitialized.

Why this is important, is probably simpler to see with the help of a C++-example:

struct Vec3 do_something(void){
   struct Vec3 res;
   res.x = 0.0; 
   res.y = func_which_throws(); 
   res.z = 0.0; 
   return res;
}

do_something writes directly to the memory pointed to by %rdi (as shown in the multiple listings in this Q&A), and it is allowed do so, only because this memory is uninitialized: if func_which_throws() throws and the exception is caught somewhere, then nobody will know, that we have changed only the x-component ot the result, because nobody knows which original value it had prior to be passed to do_something (nobody could have read the original value, because it would be UB).

The above would break for passing out-pointer as hidden pointer, because it could be observed, that only a part and not the whole memory was changed in case of an exception being thrown and caught.

Now, C has something similar to C++'s exceptions: setjmp and longjmp. Never heard of them before, but it looks like in comparison to C++-example setjmp is best described as try ... catch ... and longjmp as throw.

This means, that also for C we must ensure, that the space provided by the caller is uninitialized.

Even without setjmp/longjmp there are some other issues, among others: interoperability with C++-code, which has exceptions, and -fexceptions option of gcc-compiler.


Corollary: The desired optimization would be possible if we had a qualifer for unitialized memory (which we don't have), e.g. uninit, then

void use(struct Vec3 *restrict uninit out);

would do the trick.

ead
  • 32,758
  • 6
  • 90
  • 153
  • 1
    "uninitialized" is a sufficient requirement; but as I commented on the GCC bug I think this would be safe if the compiler could prove that `do_something` was `noexcept` and didn't call `longjmp`, if `restrict` works the way M.M argues it does. Or if `*out` was known to point to a local in a parent function that didn't catch exceptions, so if an exception did occur it would definitely unwind past the scope of the object pointed to by `out` (again making it impossible to see writes to `*out` done before the longjmp or exception, which don't happen in the C or C++ abstract machines.) – Peter Cordes Aug 10 '19 at 02:33
  • So inter-procedural analysis / optimization could make this safe in theory. Presumably still impractical for GCC to actually look for, though. – Peter Cordes Aug 10 '19 at 02:34