std::min vs ternary gcc auto vectorization with #pragma GCC optimize ("O3")

Question

I know that "why is my compiler doing this" aren't the best type of questions, but this one is really bizarre to me and I'm thoroughly confused.

I had thought that std::min() was the same as the handwritten ternary (with maybe some compile time template stuff), and it seems to compile down into the same operation when used normally. However, when trying to make a "min and sum" loop autovectorize they don't seem to be the same, and I would love if someone could help me figure out why. Here is a small example code that produces the issue:

#pragma GCC target ("avx2")
#pragma GCC optimize ("O3")

#include <cstdio>
#include <cstdlib>
#include <algorithm>

#define N (1<<20)
char a[N], b[N];

int main() {
    for (int i=0; i<N; ++i) {
        a[i] = rand()%100;
        b[i] = rand()%100;
    }

    int ans = 0;
    #pragma GCC ivdep
    for (int i=0; i<N; ++i) {
        //ans += std::min(a[i], b[i]);
        ans += a[i]>b[i] ? a[i] : b[i];
    }
    printf("%d\n", ans);
}

I compile this on gcc 9.3.0, with the compilation command g++ -o test test.cpp -ftree-vectorize -fopt-info-vec-missed -fopt-info-vec-optimized -funsafe-math-optimizations.

And the code above as is debugs during compilation as:

test.cpp:19:17: optimized: loop vectorized using 32 byte vectors

In contrast, if I comment the ternary and uncomment the std::min, I get this:

test.cpp:19:17: missed: couldn't vectorize loop
test.cpp:20:35: missed: statement clobbers memory: _9 = std::min<char> (_8, _7);

So std::min() seems to be doing something unusual that prevents gcc from understanding that it is just a min operation. Is this something that is caused by the standard? Or is it an implementation failure? Or is there some compile flag that would make this work?

It appears that adding an optimization level of `-O1` or higher [achieves the vectorization](https://godbolt.org/z/EYeTe4EfW). I suspect it's a matter of function inlining. — Drew Dormann, May 21 '21 at 16:21
@MarekR this isn't premature optimization, on my computer it sped up by >2x. Your compiler explorer link I think doesn't have the vectorization flags, because the asm output has no vectorization instructions (e.g. `vpaddd` etc) — Maltysen, May 21 '21 at 16:24
Yeah, none of those optimization options actually does anything unless you use `-O`. In particular, without `-O` the compiler won't inline `std::min`, so for all it knows it might modify the globals `a,b`. — Nate Eldredge, May 21 '21 at 16:24
I have the `optimze("O3")` pragma at the top of my code? Do i just not understand how that works? — Maltysen, May 21 '21 at 16:26
Hmm, that's a good question. At a quick reading, `#pragma GCC optimize ("O3")` is equivalent to specifying `__attribute__((optimize("O3")))` on every function. But I might guess that this does not allow for interprocedural optimizations like inlining, and so it may not be equivalent to using `-O3` on the command line. Anyway, this shows that in practice they are *not* equivalent. — Nate Eldredge, May 21 '21 at 16:29
@NateEldredge ah wow that's embarrassing, so that pragma is not the same as normal `-O3`. The more you know i guess. If you make that an answer I'll accept it. Thanks for your help. — Maltysen, May 21 '21 at 16:43
Consider using portable OpenMP SIMD directives rather than compiler specific ivdep, especially if your goal is vectorization (eg. SIMD functions and `#pragma omp simd`). Note that GCC is very sensitive to the operator in the ternary (`(a>b)?c:d` can surprizingly give different results than `(b>a)?d:c`) and have some difficulties generating branchless code... Note also that aliasing often impact the capacity of GCC to generate vectorized assembly codes. — Jérôme Richard, May 21 '21 at 17:39

Nate Eldredge · Accepted Answer · 2021-05-21T17:56:45.697

Summary: don't use #pragma GCC optimize. Use -O3 on the command line instead, and you'll get the behavior you expect.

GCC's documentation on #pragma GCC optimize says:

Each function that is defined after this point is treated as if it had been declared with one optimize(string) attribute for each string argument.

And the optimize attribute is documented as:

The optimize attribute is used to specify that a function is to be compiled with different optimization options than specified on the command line. [...] The optimize attribute should be used for debugging purposes only. It is not suitable in production code. [Emphasis added, thanks Peter Cordes for spotting the last part.]

So, don't use it.

In particular, it looks like specifying #pragma GCC optimize ("O3") at the top of your file is not actually equivalent to using -O3 on the command line. It turns out that the former doesn't result in std::min being inlined, and so the compiler actually does assume that it might modify global memory, such as your a,b arrays. This naturally inhibits vectorization.

A careful reading of the documentation for __attribute__((optimize)) makes it look like each of the functions main() and std::min() will be compiled as if with -O3. But that's not the same as compiling the two of them together with -O3, as only in the latter case would interprocedural optimizations like inlining be available.

Here is a very simple example on godbolt. With #pragma GCC optimize ("O3") the functions foo() and please_inline_me() are each optimized, but please_inline_me() does not get inlined. But with -O3 on the command line, it does.

A guess would be that the optimize attribute, and by extension #pragma GCC optimize, causes the compiler to treat the function as if its definition were in a separate source file which was being compiled with the specified option. And indeed, if std::min() and main() were defined in separate source files, you could compile each one with -O3 but you wouldn't get inlining.

Arguably the GCC manual should document this more explicitly, though I guess if it's only meant for debugging, it might be fair to assume it's intended for experts who would be familiar with the distinction.

If you really do compile your example with -O3 on the command line, you get identical (vectorized) assembly for both versions, or at least I did. (After fixing the backwards comparison: your ternary code is computing max instead of min.)

Probably `-O0` (explicit or default) is "special". Even if you manually enable all the optimization options that `-O2` includes (reported in asm comments with `-fverbose-asm`), you still have some debug-mode behaviours, IIRC, like not inlining or even still syncing vars to memory between statements. Maybe setting O3 via a pragma doesn't fully get it out of debug-mode. — Peter Cordes, May 21 '21 at 17:02
GCC's manual [also says](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-optimize-function-attribute) *The optimize **attribute** should be used for debugging purposes only. It is not suitable in production code.* - I assume that applies to the pragma as well. — Peter Cordes, May 21 '21 at 17:04
@PeterCordes: Good spot in the manual. I added this to the answer. — Nate Eldredge, May 21 '21 at 17:57

std::min vs ternary gcc auto vectorization with #pragma GCC optimize ("O3")

1 Answers1