Unrolling pointer increment loop for auto vectorization

Question

I was wondering if unrolling this loop:

for (int i = 0; i < n; i++) {
    *p = a[i]*b[i];
    p++;
}

into

for (int i = 0; i < n; i+=4) {
    *(p + 0) = a[i + 0]*b[i + 0];
    *(p + 1) = a[i + 1]*b[i + 1];
    *(p + 2) = a[i + 2]*b[i + 2];
    *(p + 3) = a[i + 3]*b[i + 3];
    p+=4;
}

would help the compiler in terms of auto-vectorization.

I can imagine that it will vectorize the first loop anyway. But does being explicit help?

I tried - unrolling seems to harm rather than help. https://godbolt.org/g/P9fbpH — peterchen, Mar 22 '17 at 11:21
The only way to know for sure is to compile and see. Doing the unrolling yourself could hurt you rather than help you. Write clear, clean, maintainable code first. Then after you do that compile with optimizations and profile to see if you even need to work on it. — NathanOliver, Mar 22 '17 at 11:40
When you do this, it is good practice to rearrange so read everything first, then calculate, then write back (if you do so here, GCC *will* autovectorize). Mixing reading and writing scares the compiler because of possible aliasing. You can also try "restrict" but it's a bit odd and not officially in C++. — harold, Mar 22 '17 at 13:22
hi @harold, can you tell a lil bit more? About provoking auto-vectorization — Armen Avetisyan, Mar 22 '17 at 18:59
@ArmenAvetisyan "provoke" may go a bit far, it's mainly "the absence of excuses not to auto-vectorize". It will still be fickle, and not everything that is auto-vectorized is auto-vectorized efficiently. If you definitely want vectorization, you can make sure you get it by using SIMD intrinsics. — harold, Mar 22 '17 at 19:05

Elijan9 · Accepted Answer · 2017-03-22T15:42:25.460

For successful auto-vectorization, your compiler needs to be able to determine that the involved variables do not alias, which means the compiler needs certainty that a, b and p never overlap, e.g.:

void somefunction()
{
    int a[12] = { ... };
    int b[12] = { ... }
    int p[12];

    /* Compiler knows: a, b and p do not overlap */
}

void multiply(int n, int* p, int* a, int* b)
{
    /* Compiler unsure: a, b and p could overlap, e.g.:
         multiply(8, array1, array1, array1);
       or worse:
         multiply(8, array1 + 1, array1, array1 + 2);
    */
}

If they do overlap, the first iteration could influence the next one and therefore they cannot be performed in parallel.

For a function, you can actually promise the compiler that arguments will not overlap by using the restrict keyword. Unfortunately only officially supported in the C standard and not yet C++. However, a lot of C++ compilers support a similar keyword, e.g. __restrict__ for gcc and clang, and __restrict for MSVC. For example for gcc:

void multiply(int n, int* __restrict__ p, int* __restrict__ a, int* __restrict__ b)

{
    for (int i = 0; i < n; i++) {
       p[i] = a[i]*b[i];
    }
}

The resulting code (using gcc -O2 -ftree-vectorize) seems pretty decent:

multiply(int, int*, int*, int*):
        test    edi, edi
        jle     .L1
        lea     r8d, [rdi-4]
        lea     r9d, [rdi-1]
        shr     r8d, 2
        add     r8d, 1
        cmp     r9d, 2
        lea     eax, [0+r8*4]
        jbe     .L9
        xor     r9d, r9d
        xor     r10d, r10d
.L5:
        movdqu  xmm0, XMMWORD PTR [rdx+r9]
        add     r10d, 1
        movdqu  xmm2, XMMWORD PTR [rcx+r9]
        movdqa  xmm1, xmm0
        psrlq   xmm0, 32
        pmuludq xmm1, xmm2
        psrlq   xmm2, 32
        pshufd  xmm1, xmm1, 8
        pmuludq xmm0, xmm2
        pshufd  xmm0, xmm0, 8
        punpckldq       xmm1, xmm0
        movups  XMMWORD PTR [rsi+r9], xmm1
        add     r9, 16
        cmp     r10d, r8d
        jb      .L5
        cmp     eax, edi
        je      .L12
.L3:
        cdqe
.L7:
        mov     r8d, DWORD PTR [rdx+rax*4]
        imul    r8d, DWORD PTR [rcx+rax*4]
        mov     DWORD PTR [rsi+rax*4], r8d
        add     rax, 1
        cmp     edi, eax
        jg      .L7
        rep ret
.L1:
        rep ret
.L12:
        rep ret
.L9:
        xor     eax, eax
        jmp     .L3

Update: Without the restrict keyword, gcc apparently generates a function that detects aliasing and generates code for both scenarios.

By the way, your unrolled version does not account for the situation where n is not a multiple of 4 and is therefore functionally different!

Perhaps you should mention `-fstrict-aliasing` as compiler flag, if the user can guarantee non aliasing. — Armen Avetisyan, Mar 22 '17 at 16:00
@ArmenAvetisyan that doesn't help if `a` and `b` are of the same type — harold, Mar 22 '17 at 16:08

score 1 · Answer 2 · answered Mar 22 '17 at 11:55

In general, you can help yourself and the optimiser by implementing the algorithm in terms of standard algorithms.

For example:

#include <boost/iterator/zip_iterator.hpp>

void bar(int n, int * p, const int * a, const int * b)
{
    auto source_begin = boost::make_zip_iterator(boost::make_tuple(a, b));
    auto source_end = boost::make_zip_iterator(boost::make_tuple(a + n, b + n));

    std::transform(source_begin, source_end, p, [](auto&& source) {
        return boost::get<0>(source) * boost::get<1>(source);
    });
}

Which clang 3.9.1 turns into:

bar(int, int*, int const*, int const*):                        # @bar(int, int*, int const*, int const*)

... alignment stuff ...
.LBB0_7:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm0, ymmword ptr [rcx + 4*rdi]
        vpmulld ymm0, ymm0, ymmword ptr [rdx + 4*rdi]
        vmovdqu ymmword ptr [rsi + 4*rdi], ymm0
        vmovdqu ymm0, ymmword ptr [rcx + 4*rdi + 32]
        vpmulld ymm0, ymm0, ymmword ptr [rdx + 4*rdi + 32]
        vmovdqu ymmword ptr [rsi + 4*rdi + 32], ymm0
        vmovdqu ymm0, ymmword ptr [rcx + 4*rdi + 64]
        vpmulld ymm0, ymm0, ymmword ptr [rdx + 4*rdi + 64]
        vmovdqu ymmword ptr [rsi + 4*rdi + 64], ymm0
        vmovdqu ymm0, ymmword ptr [rcx + 4*rdi + 96]
        vpmulld ymm0, ymm0, ymmword ptr [rdx + 4*rdi + 96]
        vmovdqu ymmword ptr [rsi + 4*rdi + 96], ymm0
        vmovdqu ymm0, ymmword ptr [rcx + 4*rdi + 128]
        vpmulld ymm0, ymm0, ymmword ptr [rdx + 4*rdi + 128]
        vmovdqu ymmword ptr [rsi + 4*rdi + 128], ymm0
        vmovdqu ymm0, ymmword ptr [rcx + 4*rdi + 160]
        vpmulld ymm0, ymm0, ymmword ptr [rdx + 4*rdi + 160]
        vmovdqu ymmword ptr [rsi + 4*rdi + 160], ymm0
        vmovdqu ymm0, ymmword ptr [rcx + 4*rdi + 192]
        vpmulld ymm0, ymm0, ymmword ptr [rdx + 4*rdi + 192]
        vmovdqu ymmword ptr [rsi + 4*rdi + 192], ymm0
        vmovdqu ymm0, ymmword ptr [rcx + 4*rdi + 224]
        vpmulld ymm0, ymm0, ymmword ptr [rdx + 4*rdi + 224]
        vmovdqu ymmword ptr [rsi + 4*rdi + 224], ymm0
        add     rdi, 64
        add     rbx, 8
        jne     .LBB0_7
.LBB0_8:
        test    r14, r14
        je      .LBB0_11
        lea     rbx, [rdx + 4*rdi]
        lea     rax, [rcx + 4*rdi]
        lea     rdi, [rsi + 4*rdi]
        neg     r14
.LBB0_10:                               # =>This Inner Loop Header: Depth=1
        vmovdqu ymm0, ymmword ptr [rax]
        vpmulld ymm0, ymm0, ymmword ptr [rbx]
        vmovdqu ymmword ptr [rdi], ymm0
        add     rbx, 32
        add     rax, 32
        add     rdi, 32
        add     r14, 1
        jne     .LBB0_10
.LBB0_11:
        cmp     r8, r9
        je      .LBB0_16
        lea     rsi, [rsi + 4*r9]
        lea     rcx, [rcx + 4*r9]
        lea     rdx, [rdx + 4*r9]
.LBB0_13:
        add     rcx, 4
        add     rdx, 4
.LBB0_14:                               # =>This Inner Loop Header: Depth=1
        mov     rax, rdx
        mov     edx, dword ptr [rcx - 4]
        imul    edx, dword ptr [rax - 4]
        mov     dword ptr [rsi], edx
        add     rsi, 4
        lea     rdx, [rax + 4]
        cmp     r11, rcx
        lea     rcx, [rcx + 4]
        jne     .LBB0_14
        cmp     r10, rax
        jne     .LBB0_14
.LBB0_16:
        pop     rbx
        pop     r14
        vzeroupper
        ret

Ignoring the alignment check, I think you'll agree that the compiler did a pretty good job.

however, gcc seems to miss this opportunity. Possible defect?

Unrolling pointer increment loop for auto vectorization

2 Answers2