How are oversized struct returned on the stack?

Question

It is said that returning an oversized struct by value (as opposed to returning a pointer to the struct) from a function incurs unnecessary copy on the stack. By "oversized", I mean a struct that cannot fit in the return registers.

However, to quote Wikipedia

When an oversized struct return is needed, another pointer to a caller-provided space is prepended as the first argument, shifting all other arguments to the right by one place.

and

When returning struct/class, the calling code allocates space and passes a pointer to this space via a hidden parameter on the stack. The called function writes the return value to this address.

It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?

wiki is correct here. but anyway frequently was copy operation (some time even more than once) function fill local variable structure first and only on return - this local structure copied to caller-provided space. if no optimization - 1 or 2 copy will be. if optimization good enough - may be no copy. but only may be. — RbMm, Jun 03 '21 at 16:05
@RbMm I see, so does passing-by-reference help saving some copy compared to passing-by-value? If the copies happen *within* the callee, then I don't think returning a pointer to the `struct` would help. — nalzok, Jun 03 '21 at 16:11
really *passing-by-value* impossible. this never happens for *oversized* struct. always caller allocate structure and pass pointer to it as hidden argument. — RbMm, Jun 03 '21 at 16:16
@RbMm: For pass-by-value, it's not impossible, Windows calling conventions just choose not to work that way. i386 System V and x86-64 System V pass struct args actually on the stack (if they're too large to fit in a pair of registers for x86-64). https://godbolt.org/z/ThMrE9rqT shows x86-64 GCC targeting Linux vs. x64 MSVC targeting Windows. However, even in the Windows calling convention, the callee "owns" the arg and can modify it, so a tmp copy is still needed, *as well as* passing a pointer to that stack memory. — Peter Cordes, Jun 04 '21 at 03:53

score 7 · Answer 1 · answered Jun 04 '21 at 13:33

If the function inlines, the copying through the return-value object can be fully optimized away. Otherwise, maybe not, and arg copying definitely can't be.

It appears that at least on x86 architectures, the struct in question is directly written by the callee to the memory appointed by the caller, so why would there be a copy then? Does returning oversized structs really incur copy on the stack?

It depends what the caller does with the return value,; if it's assigned to a provably private object (escape analysis), that object can be the return-value object, passed as the hidden pointer.
But if the caller actually wants to assign the return value to other memory, then it does need a temporary.

struct large retval = some_func();   // no extra copying at all

*p = some_func()       // caller will make space for a local return-value object & copy.

(Unless the compiler knows that p is just pointing to a local struct large tmp;, and escape analysis can prove that there's no way some global variable could have a pointer to that same tmp var.)

long version, same thing with more details:

In the C abstract machine, there's a "return value object", and return foo copies the named variable foo to that object, even if it's a large struct. Or return (struct lg){1,2}; copies an anonymous struct. The return-value object itself is anonymous; nothing can take its address. (You can't int *p = &foo(123);). This makes it easier to optimize away.

In the caller, that anonymous return-value object can be assigned to whatever you want, which would be another copy if compilers didn't optimize anything. (All of this applies for any type, even int). Of course, compilers that aren't total garbage will avoid some, ideally all, of that copying, when doing so can't possibly change the observable results. And that depends on the design of the calling convention. As you say, most conventions, including all the mainstream x86 and x86-64 conventions, pass a "hidden pointer" arg for return values they choose not to return in register(s) for whatever reason (size, C++ having a non-trivial constructor).

struct large retval = foo(...);

For such calling conventions, the above code is effectively transformed to

struct large retval;
foo(&retval, ...);

So it's C return-value object actually is a local in the stack-frame of its caller. foo() is allowed to store into that return-value object whenever it wants during execution, including before reading some other objects. This allows optimization within the callee (foo) as well, so a struct large tmp = ... / return tmp can be optimized away to just store into the return-value object.

So there's zero extra copying when the caller does just want to assign the function return value to a newly declared local var. (Or to a local var which it can prove is still private, via escape analysis. i.e. not pointed-to by any global vars).

But what if the caller wants to store the return value somewhere else?

void caller2(struct large *lgp) {
    *lgp = foo();
}

Can *lgp be the return-value object, or do we need to introduce a local temporary?

void caller2(struct large *lgp) {
    // foo_asm(lgp);                        // nope, possibly unsafe
    struct large retval;  foo(&retval);  *lgp = retval;    // safe
}

If you want functions to be able to write large structs to arbitrary locations, you have to "sign off" on it by making that effect visible in your source.

What prevents the usage of a function argument as hidden pointer? for more details about why *lgp can't be the return-value object / hidden pointer, and another example. "A function is allowed to assume its return-value object (pointed-to by a hidden pointer) is not the same object as anything else". Also details of whether struct large *restrict lgp would make it safe: probably yes if the function doesn't longjmp (otherwise stores to the supposedly anonymous retval object might end up as visible side effects without return having been reached), but GCC doesn't look for that optimization.
Why is tailcall optimization not performed for types of class MEMORY? - return bar() where bar returns the same struct should be possible as an optimized tailcall, causing extra copying. This can even introduce extra copying of the whole struct, as well as failing to optimize call bar / ret into jmp bar.
how c compiler treats a struct return value from a function, in ASM - thresholds for returning in registers. e.g. i386 System V always returns structs in memory, even struct {int x;};.
Is it possible within a function to get the memory address of the variable initialized by the return value?
C/C++ returning struct by value under the hood an actual example (but unfortunately using debug-mode compiler-generated asm, so it contains copying that isn't necessary).
How do objects work in x86 at the assembly level? example at the bottom of how x86-64 System V packs the bytes of a struct into RDX:RAX, or just RAX if less than 8 bytes.

An example showing early stores to the return-value object (instead of copying)

(all source + asm on the Godbolt compiler explorer)

// more or less extra size will get compilers to copy it around with SSE2 or not
struct large { int first, second; char pad[0];};

int *global_ptr;
extern int a;
NOINLINE                 // __attribute__((noinline))
struct large foo() {
    struct large tmp = {1,2};
    if (a)
        tmp.second = *global_ptr;
    return tmp;
}

(targeting GNU/Linux) clang -m32 -O3 -mregparm=1 creates an implementation that writes its return-value object before it's done reading everything else, exactly the case that would make it unsafe for the caller to pass a pointer to some globally-reachable memory.

The asm makes it clear that tmp is fully optimized away, or is the retval object.

# clang -O3 -m32 -mregparm=1
foo:
        mov     dword ptr [eax + 4], 2
        mov     dword ptr [eax], 1         # store tmp into the retval object
        cmp     dword ptr [a], 0
        je      .LBB0_2                   # if (a == 0) goto ret
        mov     ecx, dword ptr [global_ptr]      # load the global
        mov     ecx, dword ptr [ecx]             # deref it
        mov     dword ptr [eax + 4], ecx         # and store to the retval object
.LBB0_2:
        ret

(-mregparm=1 means pass the first arg in EAX, less noisy and easier to quickly visually distinguish from stack space than passing on the stack. Fun fact: i386 Linux compiles the kernel with -mregparm=3. But fun fact #2: if a hidden pointer is passed on the stack (i.e. no regparm), that arg is callee pops, unlike the rest. The function will use ret 4 to do ESP+=4 after popping the return address into EIP.)

In a simple caller, the compiler just reserves some stack space, passes a pointer to it, and then can load member variables from that space.

int caller() {
    struct large lg = {4, 5};   // initializer is dead, foo can't read its retval object
    lg = foo();
    return lg.second;
}

caller:
        sub     esp, 12
        mov     eax, esp
        call    foo
        mov     eax, dword ptr [esp + 4]
        add     esp, 12
        ret

But with a less trivial caller:

int caller() {
    struct large lg = {4, 5};
    global_ptr = &lg.first;
    // unknown(&lg);       // or this: as a side effect, might set global_ptr = &tmp->first;
    lg = foo();          // (except by inlining) the compiler can't know if foo() looks at global_ptr
    return lg.second;
}

caller:
        sub     esp, 28                   # reserve space for 2 structs, and alignment
        mov     dword ptr [esp + 12], 5
        mov     dword ptr [esp + 8], 4        # materialize lg
        lea     eax, [esp + 8]
        mov     dword ptr [global_ptr], eax   # point global_ptr at it
        lea     eax, [esp + 16]               # hidden first arg *not* pointing to lg
        call    foo
        mov     eax, dword ptr [esp + 20]     # reload from the retval object
        add     esp, 28
        ret

Extra copying with `*lgp = foo();`

int caller2(struct large *lgp) {
    global_ptr = &lgp->first;
    *lgp = foo();
    return lgp->second;
}

# with GCC11.1 this time, SSE2 8-byte copying unlike clang
caller2:      # incoming arg: struct large *lgp in EAX
        push    ebx     #
        mov     ebx, eax  # lgp, tmp89      # lgp needed after foo returns
        sub     esp, 24     # reserve space for a retval object (and waste 16 bytes)
        mov     DWORD PTR global_ptr, eax # global_ptr, lgp
        lea     eax, [esp+8]                # hidden pointer to the retval object
        call    foo     #
        movq    xmm0, QWORD PTR [esp+8]    # 8-byte copy of both halves
        movq    QWORD PTR [ebx], xmm0   # *lgp_2(D), tmp86
        mov     eax, DWORD PTR [ebx+4]    # lgp_2(D)->second, lgp_2(D)->second  # reload int return value
        add     esp, 24
        pop     ebx
        ret

The copy to *lgp needs to happen, but it's somewhat of a missed optimization to reload from there, instead of from [esp+12]. (Saves a byte of code size at the cost of more latency.)

Clang does the copy with two 4-byte integer register mov loads/stores, but one of them is into EAX so it already has the return value ready.

You might also want to look at the result of assigning to memory freshly allocated with malloc. Compilers know that nothing else can (legally) be pointing to the newly allocated memory: that would be use-after-free undefined behaviour. So they may allow passing on a pointer from malloc as the return-value object if it hasn't been passed to anything else yet.

Related fun fact: passing large structs by value always requires a copy (if the function doesn't inline). But as discussed in comments, the details depend on the calling convention. Windows differs from i386 / x86-64 System V calling conventions (all non-Windows OSes) on this:

SysV calling conventions copy the whole struct to the stack. (if they're too large to fit in a pair of registers for x86-64)
Windows x64 makes a copy and passes (like a normal arg) a pointer to that copy. The callee "owns" the arg and can modify it, so a tmp copy is still needed. (And no, const struct large foo has no effect.)

https://godbolt.org/z/ThMrE9rqT shows x86-64 GCC targeting Linux vs. x64 MSVC targeting Windows.

Guy Marino · Answer 2 · 2021-06-03T16:11:50.917

2

This really depends on your compiler, but in general the way this works is that the caller allocates the memory for the struct return value, but the callee also allocates stack space for any intermediate value of that structure. This intermediate allocation is used when the function is running, and then the struct is copied onto the caller's memory when the function returns.

For reference as to why your solution won't always work, consider a program which has two of the same struct and returns one based on some condition:

large_t returntype(int condition) {
  large_t var1 = {5};
  large_t var2 = {6};

  // More intermediate code here

  if(condition) return var1;
  else return var2;
}

In this case, both may be required by the intermediate code, but the return value is not known at compile time, so the compiler doesn't know which to initialize on the caller's stack space. It's easier to just keep it local and copy on return.

EDIT: Your solution may be the case in simple functions, but it really depends on the optimizations performed by each individual compiler. If you're really interested in this, check out https://godbolt.org/

edited Jun 03 '21 at 16:11

answered Jun 03 '21 at 16:05

Guy Marino

429
3
6

This isn't really a difference from pass-by-pointer, though, is it? If `returntype` took a `large_t *p` argument, converting your code naively would result in `*p = condition ? var1 : var2;`, which also involves a copy. If you want to avoid the copy you have to rewrite as `p->x = condition ? 5 : 6;`, but if you can do that, then you can rewrite your version as `large_t r; r.x = condition ? 5 : 6; return r;` and now you are back to something where the compiler can optimize out the copy. – Nate Eldredge Jun 03 '21 at 16:11
This is not intended to be a specific example, rather just a general purpose response. It is far easier to copy a contiguous struct's memory than change multiple values individually on most CPU's thanks to special memory copy instructions. If `large_t` had a lot of fields, then your solution would actually take more time than copying the whole struct. – Guy Marino Jun 03 '21 at 16:15
1

The caller *is* allowed to write to its return-value object early. My answer contains an example of clang asm output which actually does that, for similar source, eliminating any extra copying inside this function. (If you're unlucky, though, a compiler might actually materialize `var1` and `var2` on the stack, especially if you take the address of either one somewhere before the condition.) (ping @NateEldredge) – Peter Cordes Jun 04 '21 at 13:37

How are oversized struct returned on the stack?

2 Answers2

An example showing early stores to the return-value object (instead of copying)

Extra copying with `*lgp = foo();`

Linked

Related

How are oversized struct returned on the stack?

2 Answers2

An example showing early stores to the return-value object (instead of copying)

Extra copying with *lgp = foo();

Linked

Related

Extra copying with `*lgp = foo();`