0

I am searching for the most efficient way to multiply two aligned int16_t arrays whose length can be divided by 16 with AVX2.

After multiplication into a vector x I started with _mm256_extracti128_si256 and _mm256_castsi256_si128 to have the low and high part of x and added them with _mm_add_epi16.

I copied the result register and applied _mm_move_epi64 to the original register and added both again with _mm_add_epi16. Now, I think that I have:
-, -, -, -, x15+x7+x11+x3, x14+x6+x10+x2, x13+x5+x9+x1, x12+x4+x8+x0 within the 128bit register. But now I am stuck and don't know how to efficiently sum up the remaining four entries and how to extract the 16bit result.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user3572032
  • 133
  • 14
  • 2
    `_mm_move_epi64` is not a useful part of a horizontal sum. Also, you didn't describe how you got from 16-bit int elements to float for `_mm_add_ps`. It makes little sense to `_mm_add_epi16` for one step of the horizontal sum but `_ps` for another. But anyway, use `pmaddwd` for the integer multiply step to accumulate horizontally into 32-bit elements, simplifying the hsum into 32-bit elements and giving more range without overlow. Anyway, if overflow isn't a problem, you want to do the hsum after the vertical `pmaddwd` / `paddd` loop. – Peter Cordes May 27 '20 at 11:08
  • [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) – Peter Cordes May 27 '20 at 11:09
  • Thank Peter, I know this thread, but I couldn't find a solution to my specific data-type there (if the answer is already there, then I don't understand it). – user3572032 May 27 '20 at 11:13
  • Yes Peter, I did some copy&paste errors, so I do not employ any conversion to float. Its all in epi16/integer. – user3572032 May 27 '20 at 11:20
  • 2
    Ok, well if that would be safe (not overflowing), then even better to initially multiply with [`_mm256_madd_epi16`](https://www.felixcloutier.com/x86/pmaddwd) and `_mm_add_epi32` in the loop, so you can just use a 32-bit hsum outside. The best way to implement 16-bit hsums uses `_mm_madd_epi16` as the first step anyway (as mentioned in that hsum link), but actually using its full power as your multiply instead of with dummy `1` multipliers is even better. – Peter Cordes May 27 '20 at 11:25
  • Overflows cannot occur. I do not understand the last part of your last sentence. Thanks again. – user3572032 May 27 '20 at 11:27
  • The inner loop of [Optimizing Numeric Program with SIMD](https://stackoverflow.com/q/62055733) is *very* similar to what you should do, but replace `_mm256_mullo_epi32` with `_mm256_madd_epi16`. (After the inner loop, it uses a silly naive way to hsum; you can replace that.) – Peter Cordes May 28 '20 at 06:55
  • I posted my temporary solution above. – user3572032 Jun 03 '20 at 15:36
  • Yes, that's the inner loop I was talking about, you could post it as an answer. – Peter Cordes Jun 03 '20 at 23:15

1 Answers1

1

Following the comments and hours of google my working solution:

// AVX multiply
hash = 1;
start1 = std::chrono::high_resolution_clock::now();
for(int i=0; i<2000000; i++) {
    ZTYPE*  xv = al_entr1.c.data();
    ZTYPE*  yv = al_entr2.c.data();

    __m256i tres = _mm256_setzero_si256();
    for(int ii=0; ii < MAX_SIEVING_DIM; ii = ii+16/*8*/)
    {
        // editor's note: alignment required.  Use loadu for unaligned
        __m256i  xr = _mm256_load_si256((__m256i*)(xv+ii));
        __m256i  yr = _mm256_load_si256((__m256i*)(yv+ii));
        const __m256i tmp = _mm256_madd_epi16 (xr, yr);
        tres =  _mm256_add_epi32(tmp, tres);
    }

    // Reduction
    const __m128i x128 = _mm_add_epi32  ( _mm256_extracti128_si256(tres, 1), _mm256_castsi256_si128(tres));
    const __m128i x128_up = _mm_shuffle_epi32(x128, 78);
    const __m128i x64  = _mm_add_epi32  (x128, x128_up);
    const __m128i _x32 =  _mm_hadd_epi32(x64, x64);

    const int res = _mm_extract_epi32(_x32, 0);
    hash |= res;
}

finish1 = std::chrono::high_resolution_clock::now();
elapsed1 = finish1 - start1;
std::cout << "AVX multiply: " <<elapsed1.count() << " sec. (" << hash << ")" << std::endl;

It is at least the fastest solution so far:

  • std::inner_product: 0.819781 sec. (-14335)
  • std::inner_product (aligned): 0.964058 sec. (-14335)
  • naive multiply: 0.588623 sec. (-14335)
  • Unroll multiply: 0.505639 sec. (-14335)
  • AVX multiply: 0.0488352 sec. (-14335)
user3572032
  • 133
  • 14