simd vectorlength and unroll factor for fortran loop

Question

I want to vectorize the fortran below with SIMD directives

!DIR$ SIMD
    DO IELEM = 1 , NELEM
      X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
    ENDDO

And I used the instruction avx2. The program is compiled by

ifort main_vec.f -simd -g -pg -O2 -vec-report6 -o vec.out -xcore-avx2 -align array32byte

Then I'd like to add VECTORLENGTH(n) clause after SIMD. If there's no such a clause or n = 2, 4, the information doesn't give information about the unroll factor

if n = 8, 16, vectorization support: unroll factor set to 2.

I've read Intel's article about vectorization support: unroll factor set to xxxx So I guess the loop is unrolled to something like:

    DO IELEM = 1 , NELEM, 2
      X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
      X(IKLE(IELEM+1)) = X(IKLE(IELEM+1)) + W(IELEM+1)
    ENDDO

Then 2 X go into a vector register, 2 W go to another, do the addition. But how does the value of VECTORLENGTH work? Or maybe I don't really understand what does the vector length mean.

And since I use the avx2 instruction, for the DOUBLE PRECISION type X, what's the maximum length could be reach?

Here's part of the assembly of the loop with SSE2, VL=8 and the compiler told me that unroll factor is 2. However it used 4 registers instead of 2.

.loc    1  114  is_stmt 1
        movslq    main_vec_$IKLE.0.1(,%rdx,4), %rsi             #114.9
..LN202:
        movslq    4+main_vec_$IKLE.0.1(,%rdx,4), %rdi           #114.9
..LN203:
        movslq    8+main_vec_$IKLE.0.1(,%rdx,4), %r8            #114.9
..LN204:
        movslq    12+main_vec_$IKLE.0.1(,%rdx,4), %r9           #114.9
..LN205:
        movsd     -8+main_vec_$X.0.1(,%rsi,8), %xmm0            #114.26
..LN206:
        movslq    16+main_vec_$IKLE.0.1(,%rdx,4), %r10          #114.9
..LN207:
        movhpd    -8+main_vec_$X.0.1(,%rdi,8), %xmm0            #114.26
..LN208:
        movslq    20+main_vec_$IKLE.0.1(,%rdx,4), %r11          #114.9
..LN209:
        movsd     -8+main_vec_$X.0.1(,%r8,8), %xmm1             #114.26
..LN210:
        movslq    24+main_vec_$IKLE.0.1(,%rdx,4), %r14          #114.9
..LN211:
        addpd     main_vec_$W.0.1(,%rdx,8), %xmm0               #114.9
..LN212:
        movhpd    -8+main_vec_$X.0.1(,%r9,8), %xmm1             #114.26
..LN213:
..LN214:
        movslq    28+main_vec_$IKLE.0.1(,%rdx,4), %r15          #114.9
..LN215:
        movsd     -8+main_vec_$X.0.1(,%r10,8), %xmm2            #114.26
..LN216:
        addpd     16+main_vec_$W.0.1(,%rdx,8), %xmm1            #114.9
..LN217:
        movhpd    -8+main_vec_$X.0.1(,%r11,8), %xmm2            #114.26
..LN218:
..LN219:
        movsd     -8+main_vec_$X.0.1(,%r14,8), %xmm3            #114.26
..LN220:
        addpd     32+main_vec_$W.0.1(,%rdx,8), %xmm2            #114.9
..LN221:
        movhpd    -8+main_vec_$X.0.1(,%r15,8), %xmm3            #114.26
..LN222:
..LN223:
        addpd     48+main_vec_$W.0.1(,%rdx,8), %xmm3            #114.9
..LN224:
        movsd     %xmm0, -8+main_vec_$X.0.1(,%rsi,8)            #114.9
..LN225:
   .loc    1  113  is_stmt 1
        addq      $8, %rdx                                      #113.7
..LN226:
   .loc    1  114  is_stmt 1
        psrldq    $8, %xmm0                                     #114.9
..LN227:
   .loc    1  113  is_stmt 1
        cmpq      $26000, %rdx                                  #113.7
..LN228:
   .loc    1  114  is_stmt 1
        movsd     %xmm0, -8+main_vec_$X.0.1(,%rdi,8)            #114.9
..LN229:
        movsd     %xmm1, -8+main_vec_$X.0.1(,%r8,8)             #114.9
..LN230:
        psrldq    $8, %xmm1                                     #114.9
..LN231:
        movsd     %xmm1, -8+main_vec_$X.0.1(,%r9,8)             #114.9
..LN232:
        movsd     %xmm2, -8+main_vec_$X.0.1(,%r10,8)            #114.9
..LN233:
        psrldq    $8, %xmm2                                     #114.9
..LN234:
        movsd     %xmm2, -8+main_vec_$X.0.1(,%r11,8)            #114.9
..LN235:
        movsd     %xmm3, -8+main_vec_$X.0.1(,%r14,8)            #114.9
..LN236:
        psrldq    $8, %xmm3                                     #114.9
..LN237:
        movsd     %xmm3, -8+main_vec_$X.0.1(,%r15,8)            #114.9
..LN238:

zam · Answer 1 · 2015-12-29T19:51:47.127

1) Vector Length N is a number of elements/iterations you can execute in parallel after "vectorizing" your loop (normally by putting N elements of array X into single vector register and processing them altogether by vector instruction). For simplification, think of Vector Length as value given by this formula:

Vector Length (abbreviated VL) = Vector Register Width / Sizeof (data type)

For AVX2 , Vector Register Width = 256 bit. Sizeof (double precision) = 8 bytes = 64 bits. Thus:

Vector Length (double FP, avx2) = 256 / 64 = 4

$DIR SIMD VECTORLENGTH (N) basically enforces compiler to use specified vector length (and to put N elements of array X into single vector register). That's it.

2) Unrolling and Vectorization relationship. For simplification, think of unrolling and vectorization as normally unrelated (somewhat "orthogonal") optimization techniques.

If your loop is unrolled by factor of M (M could be 2, 4,..), then it doesn't neccesarily mean that vector registers were used at all and it does not mean that your loop was parallelized in any sense. What it means instead is that M instances of original loop iterations have been grouped together into single iteration; and within given new "unwinded"/"unrolled" iteration old ex-iterations are executed sequentially, one by one (so your guessing example is absolutely correct).

The purpose of unrolling is normally making loop more "micro-architecture/memory-friendly". In more details: by making loop iterations more "fat" you normally improve the balance between pressure to your CPU resources vs. pressure to your Memory/Cache resources, especially since after unrolling you can normally reuse some data in registers more effectively.

3) Unrolling + Vectorization. It's not uncommon that Compilers simulteneously vectorize (with VL=N) and unroll (by M) certain loops. As a result, number of iterations in optimized loop is smaller than number of iterations in original loop by approximately factor of NxM, however number of elements processed in parallel (simulteneously in given moment in time) will only be N. Thus, in your example, if loop is vectorized with VL=4, and unrolled by 2, then the pseudo-code for it might look like:

DO IELEM = 1 , NELEM, 8
  [X(IKLE(IELEM)),X(IKLE(IELEM+2)), X(IKLE(IELEM+4)), X(IKLE(IELEM+6))] = ...
  [X(IKLE(IELEM+1)),X(IKLE(IELEM+3)), X(IKLE(IELEM+5)), X(IKLE(IELEM+7))] = ...
ENDDO

,where square brackets "correspond" to vector register content.

4) Vectorization against Unrolling :

for loops with relatively small number of iterations (especially in C++) - it may happen that unrolling is not desirable since it partially blocks efficient vectorization (not enough iterations to execute in parallel) and (as you see from my artifical example) may somehow impact the way the data has to be loaded from memory. Different compilers have different heuristics wrt balancing Trip Counts, VL and Unrolling between each other; that's probably why unroll was disabled in your case when VL was smaller than 8.
runtime and compile-time trade-offs between trip counts, unrolling and vector length, as well as appropiate automatic suggestions (especially in case of using fresh Intel C++ or Fortran Compiler) could be explored using "Intel (Vectorization) Advisor":

5) P.S. There is a third dimension (I don't really like to talk about it).

When vectorlength requested by user is bigger than possible Vector Length on given hardware (let's say specifying vectorlength(16) for avx2 platform for double FP) or when you mix different types, then compiler can (or can not) start using a notion of "virtual vector register" and start doing double-/quad-pumping. M-pumping is kind of unrolling, but only for single instruction (i.e. pumping leads to repeating the single instruction, while unrolling leads to repeating the whole loop body). You may try to read about m-pumping in recent OpenMP books like given one. So in some cases you may end-up with superposition of a) vectorization, b) unrolling and c) double-pumping, but it's not common case and I'd avoid enforcing vectorlength > 2*ISA_VectorLength.

I talked to authors of Intel article directly and they confirmed they have to fix it — zam, Aug 19 '15 at 09:42
(they didn't say they have to fix Compiler, the problem was about diagnostics and article only; compiler works like I explained; if you disagree, make a test case and look at assembly and you will find a confirmation) — zam, Aug 19 '15 at 09:43
OK, I am not an (x86) assembly expert (I prefer looking at GIMPLE from gfortran), but there were indeed 4 VADDPD instructions in a row. +1 — Vladimir F Героям слава, Aug 19 '15 at 10:02
Thanks for the explication, it's very clear. But I still wonder, for double precision variables, if VL > 4, then we can't make it with only one register even avx, how does it work? — Shiyu, Aug 19 '15 at 13:16
OK, I added bullet (5) in my answer just for the sake of addressing your theoretical question. Honestly I didn't want to touch multi-pumping, because normally you don't see it very often and because specifying super-big vectorlength is more like theoretical exercise with little practical applicability (although for some computations with mixture of short, int, float and double, you may end-up with doing it anyway) — zam, Aug 19 '15 at 18:01
Thanks for your time and help, zam. I added part of the assembly of the loop and I don't know if it is the multi-pumping as what you said (maybe we can't see it in the assembly here). Because from the assembly file, I think the instructions for VL=8 are nearly those for VL=2 times 4. — Shiyu, Aug 20 '15 at 08:16
I forgot to comment about code-sample itself. The code provided by you is poorly vectorizable, because it has a lot of indirect referencing (i.e. every next accessed element of X is located in "unpredictable" from compiler perspective location). Before Haswell (AVX2) there were no hardware support for moving such data into vector register "at once", so corresponding code generation was pretty much "serialized" (using shuffling/permutation instructions). Starting from AVX2 there is "vgather" instruction in hardware, which makes it possible to load such values altogether... — zam, Aug 20 '15 at 16:13
So, pretty often the memory loading/storing will be seralized, because you deal with indirect access to the memory. This could be the real reason; it's not that loop was unrolled, double-pumped or something else; instead it just has seralized memory "gathering" elements one by one due to irregular "stride".I don't know why vgather was not enabled in your case; I don't have full reproducer and not sure about compiler version used by you. If you want to experiment with unroll/VL - consider codesample with more regular or unit stride to avoid exploring to many different problems at once. — zam, Aug 20 '15 at 16:19

simd vectorlength and unroll factor for fortran loop

1 Answers1