21

Why does gcc fill the whole array with zeros instead of only the remaining 96 integers? The non-zero initializers are all at the start of the array.

void *sink;
void bar() {
    int a[100]{1,2,3,4};
    sink = a;             // a escapes the function
    asm("":::"memory");   // and compiler memory barrier
    // forces the compiler to materialize a[] in memory instead of optimizing away
}

MinGW8.1 and gcc9.2 both make asm like this (Godbolt compiler explorer).

# gcc9.2 -O3 -m32 -mno-sse
bar():
    push    edi                       # save call-preserved EDI which rep stos uses
    xor     eax, eax                  # eax=0
    mov     ecx, 100                  # repeat-count = 100
    sub     esp, 400                  # reserve 400 bytes on the stack
    mov     edi, esp                  # dst for rep stos
        mov     DWORD PTR sink, esp       # sink = a
    rep stosd                         # memset(a, 0, 400) 

    mov     DWORD PTR [esp], 1        # then store the non-zero initializers
    mov     DWORD PTR [esp+4], 2      # over the zeroed part of the array
    mov     DWORD PTR [esp+8], 3
    mov     DWORD PTR [esp+12], 4
 # memory barrier empty asm statement is here.

    add     esp, 400                  # cleanup the stack
    pop     edi                       # and restore caller's EDI
    ret

(with SSE enabled it would copy all 4 initializers with movdqa load/store)

Why doesn't GCC do lea edi, [esp+16] and memset (with rep stosd) only the last 96 elements, like Clang does? Is this a missed optimization, or is it somehow more efficient to do it this way? (Clang actually calls memset instead of inlining rep stos)


Editor's note: the question originally had un-optimized compiler output which worked the same way, but inefficient code at -O0 doesn't prove anything. But it turns out that this optimization is missed by GCC even at -O3.

Passing a pointer to a to a non-inline function would be another way to force the compiler to materialize a[], but in 32-bit code that leads to significant clutter of the asm. (Stack args result in pushes, which gets mixed in with stores to the stack to init the array.)

Using volatile a[100]{1,2,3,4} gets GCC to create and then copy the array, which is insane. Normally volatile is good for looking at how compilers init local variables or lay them out on the stack.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Lassie
  • 853
  • 8
  • 18
  • 1
    @Damien You misunderstood my question. I ask why for example the a[0] is assigned value twice as if ```a[0] = 0;``` and then ```a[0] = 1;```. – Lassie Nov 24 '19 at 20:47
  • @Mat It optimized out everything but if I use the array, it's still the same. It fills the entire array with 0s first. – Lassie Nov 24 '19 at 20:54
  • 1
    I'm not able to read the assembly, but where does it show that the array is filled entirely with zeros? – smac89 Nov 24 '19 at 21:00
  • @smac89 the `movl $100, %ecx`. FYI, `clang` doesn't fill the whole array. – Jester Nov 24 '19 at 21:04
  • Also I usually see arrays being initialized using the syntax `int a[100] = {1,2,3,4};`, not the way you have it, which is usually reserved for object initialization – smac89 Nov 24 '19 at 21:04
  • 3
    Another interesting fact: for more items initialized, both gcc and clang revert to copying the whole array from `.rodata` ... I can't believe copying 400 bytes is faster than zeroing and setting 8 items. – Jester Nov 24 '19 at 21:11
  • 2
    You disabled optimization; inefficient code isn't surprising until you verify that the same thing happens at `-O3` (which it does). https://godbolt.org/z/rh_TNF – Peter Cordes Nov 24 '19 at 22:08
  • 1
    @smac89: I updated the question with commented asm. (Using Intel syntax instead of AT&T; if the OP wants AT&T syntax Godbolt has an AT&T option.) I might have put too much "answer" into the question, e.g. the fact that clang does this optimization shows that either way is valid and it's just a choice / missed optimization, not a C++ requirement. At this point this should be a GCC missed-optimization bug report as well / instead of an SO question. https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc – Peter Cordes Nov 24 '19 at 22:46
  • 12
    What more do you want to know? It's a missed optimization, go report it on GCC's bugzilla with the `missed-optimization` keyword. – Peter Cordes Nov 27 '19 at 23:12
  • Does it depend on the array size - could you try array size other than `100` such as `64` on your system to check for difference in code generated - might depend on cache line size? – srinivirt Jan 03 '20 at 17:21
  • @Lassie could you try different array sizes and provide an update on the code generated, whether there is a cutoff size beyond which you see zeroing of entire array or not? – srinivirt Jan 04 '20 at 16:45
  • The Clang code does indeed run 63% faster on their machines, based on a quick thousand loop timing – HackerBoss Jan 07 '20 at 18:18

1 Answers1

2

In theory your initialization could look like that:

int a[100] = {
  [3] = 1,
  [5] = 42,
  [88] = 1,
};

so it may be more effective in sense of cache and optimizablity to first zero out the whole memory block and then set individual values.

May be the behavior changes depending on:

  • target architecture
  • target OS
  • array length
  • initialization ratio (explicitly initialized values/length)
  • positions of the initialized values

Of course, in your case the initialization are compacted at the start of the array and the optimization would be trivial.

So it seems that gcc is doing the most generic approach here. Looks like a missing optimization.

vlad_tepesch
  • 6,681
  • 1
  • 38
  • 80
  • Yes, an optimal strategy for *this* code probably would be to zero everything, or maybe just everything starting from `a[6]` onward with the early gaps filled with single stores of immediates or zeros. Especially if targeting x86-64 so you can use qword stores to do 2 elements at once, with the lower one non-zero. e.g. `mov QWORD PTR [rsp+3*4], 1` to do elements 3 and 4 with one misaligned qword store. – Peter Cordes Jan 22 '20 at 09:37
  • Behaviour could in theory depend on target OS, but in actual GCC it won't, and has no reason to. Only target architecture (and within that, tuning options for different microarchitectures, like `-march=skylake` vs. `-march=k8` vs. `-march=knl` would all be very different in general, and maybe in terms of appropriate strategy for this.) – Peter Cordes Jan 22 '20 at 12:11
  • 1
    Is this even allowed in C++? I thought it is only C. – Lassie Jan 22 '20 at 22:10
  • @Lassie you are right in c++ this is not allowed, but the question is more related to the compiler backend, so that it does not matter that much. also the shown code could be both – vlad_tepesch Jan 23 '20 at 13:06
  • You even could easily construct examples that work the same in C++ by declaring some `struct Bar{ int i; int a[100]; int j;} ` and initialize `Bar a{1,{2,3,4},4};` gcc does the same thing: zero all out, and then set the 5 values – vlad_tepesch Jan 23 '20 at 13:13