False autovectorization in Intel C compiler (icc)

Question

I need to vectorize with SSE a some huge loops in a program. In order to save time I decided to let ICC deal with it. For that purpose, I prepare properly the data, taking into account the alignment and I make use of the compiler directives #pragma simd, #pragma aligned, #pragma ivdep. When compiling with the several -vec-report options, compiler tells me that loops were vectorized. A quick look to the assembly generated by the compiler seems to confirm that, since you can find there plenty of vectorial instructions that works with packed single precision operands (all operations in the serial code handler float operands).

The problem is that when I take hardware counters with PAPI the number of FP operations I get (PAPI_FP_INS and PAPI_FP_OPS) is pretty the same in the auto-vectorized code and the original one, when one would expect to be significantly less in the auto-vectorized code. What's more, a vectorized by-hand a simplified problem of the one that concerns and in this case I do get something like 3 times less of FP operations.

Has anyone experienced something similar with this?

You might wanna double-check that PAPI is counting each SSE instruction as 2 or 4 FLOPs instead of 1. — Mysticial, Sep 06 '12 at 14:43
As I mentioned, I vectorized by-hand a simplified case (basically, the same problem but with much less operations) and I get 3 times less of FP operations from PAPI counters. In fact, I left some pieces of code in scalar and I can't expect that 1/4 FP operations, but 3 times less sounds reasonably enough, so I presume PAPI counts *decoded* instructions (and a vectorial instruction is just one instruction) — Genís, Sep 06 '12 at 14:53
I vectorized myself the code and the number of FP instructions fall drastically by a factor of 2. That reduction is somehow expectable since some operations have been left in its scalar version. So I assume that ICC didn't really vectorize. One possible explanation could be the huge amount of operands, which leads to a lot of register spilling and reduces the speedup vectorization can give to you, though the compiler does never complain about it. — Genís, Sep 13 '12 at 09:55

score 0 · Answer 1 · answered May 25 '15 at 13:58

Spills may destroy the advantage of vectorization, thus 64-bit mode may gain significantly over 32-bit mode. Also, icc may version a loop and you may be hitting a scalar version even though there is a vector version present. icc versions issued in the last year or 2 have fixed some problems in this area.

False autovectorization in Intel C compiler (icc)

1 Answers1