5

I have got two implementations of a dot-product: One hand-coded https://godbolt.org/z/48EEnnY4r

int bla2(const std::vector<int>& a, const std::vector<int>& b){

    int res = 0;
    for(size_t i=0; i < a.size(); ++i){
        res += a[i]*b[i];
    }
    return res;
}

and one using C++23's std::views::zip https://godbolt.org/z/TsGW1WYnf

int bla(const std::vector<int>& a, const std::vector<int>& b){

    int res = 0;
    for(const auto& [x,y]  : std::views::zip(a,b)){
        res += x*y;
    }
    return res;
}

In godbolt the hand-coded version uses a lot of SIMD instructions, while the zip-based implementation doesn't. What's going on here? If I implement it using iterators it also gets SIMD. I thought under the hood ranges just use iterators. Are these expression not equivalent?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Stein
  • 3,179
  • 5
  • 27
  • 51
  • @PepijnKramer Only the first snippet, the second doesn't. – HolyBlackCat Aug 24 '23 at 16:10
  • 1
    @PepijnKramer Yeah - but that's the "hand-coded" version. No such instructions in the second link. – Adrian Mole Aug 24 '23 at 16:11
  • The hand-coded version fails to vectorize, if you replace `i < a.size()` by `i < a.size() && i < b.size()`, but works with `i < a.size() & i < b.size()`: https://godbolt.org/z/zecvdG8qf (I'm not sure how exactly `view::zip(...).end()` is defined) – chtz Aug 24 '23 at 17:04
  • Even with `-march=x86-64-v3` we don't get vectorization of the 2nd version. (Where vectorization would be more profitable, given 256-bit vectors and the SSE4.1 / AVX2 32x32 => 32-bit SIMD multiply instruction. `pmuludq` is a widening 32x32 => 64-bit multiply, so with no -march option GCC has to shuffle to use that twice per input vector and combine the results.) – Peter Cordes Aug 24 '23 at 17:06
  • 3
    clang with libc++ (instead of libstdc++ that it and GCC use by default) does vectorize: https://godbolt.org/z/TPxG35jvK . Also, GCC's `-fopt-info-vec-missed` reports that GCC couldn't vectorize the libstdc++ loop because "*number of iterations cannot be computed*", which sounds like what @chtz found. (GCC/Clang can only vectorize loops when the trip-count isn't data-dependent, e.g. not strlen or memchr, only loops where the trip-count can be computed before the first iteration. Maybe a branchy loop condition is enough to throw it off?) – Peter Cordes Aug 24 '23 at 17:09
  • @PepijnKramer: You could delete your erroneous first comment to remove distractions for future readers. – Peter Cordes Aug 24 '23 at 20:39
  • @PeterCordes no problem, done ;) – Pepijn Kramer Aug 25 '23 at 04:55

0 Answers0