The loop in your example is vectorised for GCC < 7.1, and not vectorized for GCC >= 7.1. So there seems to be some change in behaviour here.
We can look at the compiler optimisation report by adding -fopt-info-vec-all
to the command line:
For GCC 7.3:
<source>:24:29: note: === vect_pattern_recog ===
<source>:24:29: note: === vect_analyze_data_ref_accesses ===
<source>:24:29: note: not vectorized: complicated access pattern.
<source>:24:29: note: bad data access.
<source>:21:5: note: vectorized 0 loops in function.
For GCC 6.3:
<source>:24:29: note: === vect_pattern_recog ===
<source>:24:29: note: === vect_analyze_data_ref_accesses ===
<source>:24:29: note: === vect_mark_stmts_to_be_vectorized ===
[...]
<source>:24:29: note: LOOP VECTORIZED
<source>:21:5: note: vectorized 1 loops in function.
So GCC 7.x decides not to vectorise the loop, because of a complicated access pattern, which might be the (at that point) non-inlined size()
function. Forcing inlining, or doing it manually fixes that. GCC 6.x seems to do that by itself. However, the assembly does look like size()
was eventually inlined in both cases, but maybe only after the vectorisation step in GCC 7.x (this is me guessing).
I wondered why you put the asm volatile(...)
line at the end - probably to prevent the compiler from throwing away the whole loop, because it has no observable effect in this test case. If we just return the last element of v
instead, we can reach the same without causing any possible side-effects on the memory model for v
.
return v.values[capacity - 1];
The code now vectorises with GCC 7.x, as it already did with GCC 6.x:
<source>:24:29: note: === vect_pattern_recog ===
<source>:24:29: note: === vect_analyze_data_ref_accesses ===
<source>:24:29: note: === vect_mark_stmts_to_be_vectorized ===
[...]
<source>:24:29: note: LOOP VECTORIZED
<source>:21:5: note: vectorized 1 loops in function.
So what's the conclusion here?
- something changed with GCC 7.1
- best guess: a side-effect of the
asm volatile
messes with inlining of size()
preventing vectorisation
Whether or not this is a bug - could be either in 6.x or 7.x depending on what behaviour is desired for the asm volatile()
construct - would be a question for the GCC developers.
Also: try adding -mavx2
or -mavx512f -mavx512cd
(or -march=native
etc.) to the command line, depending on your hardware, to get vectorisation beyond 128-bit xmm
, i.e. ymm
and zmm
, registers.