Why does the compiler not use SIMD in my range-expression?

Question

I have got two implementations of a dot-product: One hand-coded https://godbolt.org/z/48EEnnY4r

int bla2(const std::vector<int>& a, const std::vector<int>& b){

    int res = 0;
    for(size_t i=0; i < a.size(); ++i){
        res += a[i]*b[i];
    }
    return res;
}

and one using C++23's std::views::zip https://godbolt.org/z/TsGW1WYnf

int bla(const std::vector<int>& a, const std::vector<int>& b){

    int res = 0;
    for(const auto& [x,y]  : std::views::zip(a,b)){
        res += x*y;
    }
    return res;
}

In godbolt the hand-coded version uses a lot of SIMD instructions, while the zip-based implementation doesn't. What's going on here? If I implement it using iterators it also gets SIMD. I thought under the hood ranges just use iterators. Are these expression not equivalent?

@PepijnKramer Yeah - but that's the "hand-coded" version. No such instructions in the second link. — Adrian Mole, Aug 24 '23 at 16:11
The hand-coded version fails to vectorize, if you replace `i < a.size()` by `i < a.size() && i < b.size()`, but works with `i < a.size() & i < b.size()`: https://godbolt.org/z/zecvdG8qf (I'm not sure how exactly `view::zip(...).end()` is defined) — chtz, Aug 24 '23 at 17:04
Even with `-march=x86-64-v3` we don't get vectorization of the 2nd version. (Where vectorization would be more profitable, given 256-bit vectors and the SSE4.1 / AVX2 32x32 => 32-bit SIMD multiply instruction. `pmuludq` is a widening 32x32 => 64-bit multiply, so with no -march option GCC has to shuffle to use that twice per input vector and combine the results.) — Peter Cordes, Aug 24 '23 at 17:06
clang with libc++ (instead of libstdc++ that it and GCC use by default) does vectorize: https://godbolt.org/z/TPxG35jvK . Also, GCC's `-fopt-info-vec-missed` reports that GCC couldn't vectorize the libstdc++ loop because "*number of iterations cannot be computed*", which sounds like what @chtz found. (GCC/Clang can only vectorize loops when the trip-count isn't data-dependent, e.g. not strlen or memchr, only loops where the trip-count can be computed before the first iteration. Maybe a branchy loop condition is enough to throw it off?) — Peter Cordes, Aug 24 '23 at 17:09
@PepijnKramer: You could delete your erroneous first comment to remove distractions for future readers. — Peter Cordes, Aug 24 '23 at 20:39

Why does the compiler not use SIMD in my range-expression?

0 Answers0