2

Below is a downstripped example of a tagged union template "Storage", which can assume two types L and R enclosed in a union, plus a bool indicating which of them is stored. The instantiation uses two different sized types, the smaller one actually being empty.

#include <utility>

struct Empty
{
};

struct Big
{
        long a;
        long b;
        long c;
};

template<typename L, typename R>
class Storage final
{
public:
        constexpr explicit Storage(const R& right) : payload{right}, isLeft{false}
        {
        }

private:
        union Payload
        {
                constexpr Payload(const R& right) : right{right}
                {
                }
                L left;
                R right;
        };

        Payload payload;
        bool isLeft;
};

// Toggle constexpr here
constexpr static Storage<Big, Empty> createStorage()
{
        return Storage<Big, Empty>{Empty{}};
}

Storage<Big, Empty> createStorage2()
{        
        return createStorage();
}
  • The constructor initializes the R-member with Empty, and is only calling the union's constructor for that member
  • The union is never default initialized as a whole
  • All constructors are constexpr

The function "createStorage2" should therefor only populate the bool tag, and leave the union alone. So I would expect a compile result with default optimization "-O":

createStorage2():
        mov     rax, rdi
        mov     BYTE PTR [rdi+24], 0
        ret

Both GCC and ICC instead generate something like

createStorage2():
        mov     rax, rdi
        mov     QWORD PTR [rdi], 0
        mov     QWORD PTR [rdi+8], 0
        mov     QWORD PTR [rdi+16], 0
        mov     QWORD PTR [rdi+24], 0
        ret

zeroing the entire 32 byte structure, while clang generates the expected code. You can reproduce this with https://godbolt.org/z/VsDQUu. GCC will revert to the desired initialization of the bool tag only, when you remove constexpr from the "createStorage" static function, while ICC remains unimpressed and still fills all 32 bytes.

Doing so is probably not a standard violation, as unused bits being "undefined" allows anything, including being set to zero and consuming unnecessary CPU cycles. But it's annoying, if you introduced the union for efficiency reason in first place, and your union members vary largely in size.

What is going on here? Is the any way to work around this behavior, provided that removing constexpr from constructors and the static function is not an option?

A side note: ICC seems to perform some extra operations even when all constexpr are removed, as in https://godbolt.org/z/FnjoPC:

createStorage2():
        mov       rax, rdi                                      #44.16
        mov       BYTE PTR [-16+rsp], 0                         #39.9
        movups    xmm0, XMMWORD PTR [-40+rsp]                   #44.16
        movups    xmm1, XMMWORD PTR [-24+rsp]                   #44.16
        movups    XMMWORD PTR [rdi], xmm0                       #44.16
        movups    XMMWORD PTR [16+rdi], xmm1                    #44.16
        ret                                                     #44.16

What is the purpose of these movups instructions?

1 Answers1

0

(This is just speculation of mine, but it's too long for a comment)

What is going on here?

Since constructors are constexpr, it could be that the Payload as a whole has some value computed at compile-time. Then, at runtime, that complete Payload is returned. To my knowledge, it is not required for a compiler to recognize that a certain portion of a compile-time value is uninitialized and that it should generate no code for it.

In some crazy compiler it could even happen that the compile-time Payload has garbage values in an uninitialized section, and then it would produce for example:

createStorage2():
        mov     rax, rdi
        mov     QWORD PTR [rdi], 0xbaadf00d
        mov     QWORD PTR [rdi+8], 0xbaadf00d
        mov     QWORD PTR [rdi+16], 0xbaadf00d
        mov     QWORD PTR [rdi+24], 0
        ret

In general constexpr doesn't like uninitialized values, but unions are a way around it a bit.

CygnusX1
  • 20,968
  • 5
  • 65
  • 109
  • This is what I meant: There is certainly nothing in the C++ standard that forbids initialization of undefined bit to whatever it likes. It will still produce correct output, just using more CPU cycles than necessary. – Michael Steffens Mar 09 '20 at 11:00
  • OTOH, even if the C++ standard does not prescribe anything particular for any optimization, we all expect that optimizing compilers do a reasonable job eliminating at least the most local redundancies, when we argue that one shouldn't hand-optimize source code at the cost of readability and simplicity. Looks like a failure here. – Michael Steffens Mar 09 '20 at 11:53
  • I think it's not that easy to optimise this. The compiler has to explicitly track all uninitialized bytes of all constexpr values it holds during compilation, to recognize that it should not generate `mov addr, constant` statements for these. It's harder than just initialize it with garbage. – CygnusX1 Mar 09 '20 at 13:11
  • What does it need to explicitly track? A constexpr value can be computed at compile time, and the compiler needs to have a way to put a value in place whenever it's needed. For example using a sequence of statements (GCC, Clang). Copying a precomputed value from a source address would be an alternative method (ICC, as it seems?). In any case one separate sequence or source address is required for each individual value. In case of using a sequence, it doesn't even need to know the size of the target, just its location. Could you give an example why this would be oversimplified? – Michael Steffens Mar 10 '20 at 10:00
  • Your `Payload` is such an example. It's computed constexpr value is (in chunks of 64-bits): `[garbage, garbage, garbage, 0x0]`. A naive compiler then creates 64-bit `mov` instructions, where some load those garbage values into a memory. A more clever compiler needs to remember your `Payload` as `[, , , 0x0]` and treat those values in a special way that skips the generation of `mov` instructions. This traking of versus gargabe is what I meant. – CygnusX1 Mar 10 '20 at 11:19