I need to vectorize with SSE a some huge loops in a program. In order to save time I decided to let ICC deal with it. For that purpose, I prepare properly the data, taking into account the alignment and I make use of the compiler directives #pragma simd
, #pragma aligned
, #pragma ivdep
. When compiling with the several -vec-report
options, compiler tells me that loops were vectorized. A quick look to the assembly generated by the compiler seems to confirm that, since you can find there plenty of vectorial instructions that works with packed single precision operands (all operations in the serial code handler float operands).
The problem is that when I take hardware counters with PAPI the number of FP operations I get (PAPI_FP_INS
and PAPI_FP_OPS
) is pretty the same in the auto-vectorized code and the original one, when one would expect to be significantly less in the auto-vectorized code. What's more, a vectorized by-hand a simplified problem of the one that concerns and in this case I do get something like 3 times less of FP operations.
Has anyone experienced something similar with this?