Array initialization optimization

Question

When compiling the following code snippet (clang x86-64 -O3)

std::array<int, 5> test()
{
    std::array<int, 5> values {{0, 1, 2, 3, 4}};
    return values;
}

It produced the typical assembly that I would expect

test():                               # @test()
        mov     rax, rdi
        mov     ecx, dword ptr [rip + .L__const.test().values+16]
        mov     dword ptr [rdi + 16], ecx
        movups  xmm0, xmmword ptr [rip + .L__const.test().values]
        movups  xmmword ptr [rdi], xmm0
        ret
.L__const.test().values:
        .long   0                       # 0x0
        .long   1                       # 0x1
        .long   2                       # 0x2
        .long   3                       # 0x3
        .long   4                       # 0x4

However for small arrays, it seems to have figured out a trick?

std::array<int, 3> test()
{
    std::array<int, 3> values {{0, 1, 2}};
    return values;
}

This was the corresponding assembly

test():                               # @test()
        movabs  rax, 4294967296
        mov     edx, 2
        ret

Where did that magic number (4294967296) come from? Is that essentially a value that can be reinterpret_cast back into an array of int somehow?

On Godbolt you can mouseover a number to see it in Hex. If you look at asm often, you're going to get used to `4294967...` being about 2^32 and thus a huge clue that you should look at the hex to see the upper/lower 32 bits. (Or for numbers just *below* 2^32, that it's actually a negative 32-bit integer.) — Peter Cordes, Jun 20 '19 at 06:26

Martin Bonner supports Monica · Accepted Answer · 2019-06-20T05:32:40.943

40

A std::array<int, 3> is 96 bits wide on your implementation. As such the ABI declares that it should be returned in RAX + the low 32 bits of RDX (aka EDX).

4294967296 is 2³², in hex it is $1'0000'0000. So the movabs stores 0 in the low order 32 bits of RAX, and 1 in the high order bits of RAX. The mov stores 2 in EDX (which is exactly what you wanted).

edited Jun 20 '19 at 05:32

answered Jun 19 '19 at 13:05

Martin Bonner supports Monica

28,528
3
51
88

D'oh! Thanks sasha for proposing the edit, and caf for doing it. – Martin Bonner supports Monica Jun 20 '19 at 05:33
3

Fun fact: with BMI2 a probably more optimal way to compile this would be `mov edx, 2` and then [`rorx rax, rdx, 33`](https://www.felixcloutier.com/x86/RORX.html) to create `1<<32` from the `1<<1`. `movabs` is so big (10 bytes) that it's slow to decode and can take extra space in the uop cache. `rorx` is 6 bytes (3-byte VEX + opcode + modrm + imm8) but its immediate is small. It does have a dependency on the `mov`-immediate and can only run on the shift ports on Intel CPUs, but normally that's not a bottleneck (especially in a code path that leads to a `ret`, not contained in a tight loop) – Peter Cordes Jun 20 '19 at 06:33
@PeterCordes I think the optimizer can be forgiven for not spotting this particular possibility - it's *very* specific to the particular data pattern! – Martin Bonner supports Monica Jun 20 '19 at 10:24
clang does look for things like `lea r64, [reg + disp8]` to create a 2nd large constant instead of 2x `movabs`. e.g. see [Signed saturated add of 64-bit ints?](//stackoverflow.com/a/56531252) where clang uses `movabs rcx, 9223372036854775807` ; `lea rax, [rcx + 1]` to materialize `INT64_MAX` and `INT64_MIN` in RCX and RAX respectively, in versions of the function that need both. https://godbolt.org/z/iD6Ml8 GCC just uses 2x `movabs` even with `-Os` – Peter Cordes Jun 20 '19 at 21:27
GCC and clang both look for optimizations like that in some cases: https://godbolt.org/z/wctons like using a `sub rax, 120` to create a 2nd nearby constant instead of using another `movabs`. They're not very good at it, though, and yeah probably aren't looking for the possibility of rotating :P – Peter Cordes Jun 20 '19 at 21:32

Array initialization optimization

1 Answers1