0

With the help of YOU, I have used SSE in my code (sample below) with significant performance boost and I was wondering if this boost could be improved by using 256bit registers of AVX.

int result[4] __attribute__((aligned(16))) = {0};
__m128i vresult = _mm_set1_epi32(0);
__m128i v1, v2, vmax;
    for (int k = 0; k < limit; k += 4) {
        v1 = _mm_load_si128((__m128i *) & myVector[positionNodeId + k]);
        v2 = _mm_load_si128((__m128i *) & myVector2[k]);
        vmax = _mm_add_epi32(v1, v2);
        vresult = _mm_max_epi32(vresult, vmax);
    }
_mm_store_si128((__m128i *) result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]); 

So, I have 3 questions: How would the above rather simple SSE code could be converted to AVX? WHat header should I import for that? And what flag should I tell my gcc compiler (instead of -sse4.1) for AVX to work?

Thanks in advance. for your help.

Alexandros
  • 2,160
  • 4
  • 27
  • 52
  • So have you tried to solve it yourself? [By the way, isn't there a horizontal max somewhere in SSE, to save the last line, whichb will be even worse in AVX] – Mats Petersson Sep 03 '13 at 09:02
  • Could you tell (for anyone interested among US) what performance improvements you get and comparing to what code? – SChepurin Sep 03 '13 at 09:02
  • @MatsPetersson, as far as I know there is no general horizontal max/min in SSE/AVX. The only instruction I know of is [_mm_minpos_epu16](http://msdn.microsoft.com/en-us/library/vstudio/bb514085%28v=vs.100%29.aspx). Negation can be used for the max. But that only works for 16-bit unsigned words. – Z boson Sep 03 '13 at 10:11
  • 1
    As you the see the max command runs only one time outside the loop. So, it is really not a big deal – Alexandros Sep 03 '13 at 10:14
  • In this code, limit = 64. Also, this code runs many thousand times (30,000-1,000,000) and I get a solid 10-20% improvement. – Alexandros Sep 03 '13 at 10:15
  • That's right, if k is large enough (which it should be to be doing this) the last horizontal max should be insignificant. – Z boson Sep 03 '13 at 10:22
  • If k is only 64 then the the horizontal sum probably has a small effect. Even if you run the function many times each time you have to do the horizontal sum on 8/64 elements. But I don't think you have any other option. Doing the horizontal sum with AVX will require several permutation/shuffle/max instructions anyway. – Z boson Sep 03 '13 at 10:26
  • @AlexandrosE, have you tried unrolling the loop in your SSE code? You might get a boost unrolling a few times – Z boson Sep 03 '13 at 13:12
  • @AlexandrosE, I updated my answer with a horizontal max function using SSE and AVX. – Z boson Sep 03 '13 at 18:35

1 Answers1

2
1.) This code can be easily converted to AVX2 (see below)
2.) #include <x86intrin.h>
3.) compile with -mavx2

You will need a CPU that supports AVX2. Currently only Intel Haswell processors support this. I don't have a Haswell processor (yet) so I could not test the code.

    int result[8] __attribute__((aligned(32))) = {0};
    __m256i vresult = _mm256_set1_epi32(0);
    __m256i v1, v2, vmax;

    for (int k = 0; k < limit; k += 8) {
        v1 = _mm256_load_si256((__m256i *) & myVector[positionNodeId + k]);
        v2 = _mm256_load_si256((__m256i *) & myVector2[k]);
        vmax = _mm256_add_epi32(v1, v2);    
        vresult = _mm256_max_epi32(vresult, vmax);
    }
    return horizontal_max_Vec8i(vresult);
    //_mm256_store_si256((__m256i *) result, vresult);
    //int mymax = result[0];
    //for(int k=1; k<8; k++) {
    //    if(result[k]>mymax) mymax = result[k];
    //}
    //return mymax;

Edit: I suspect that since you are only running over 64 elements that the horizontal max has a small but not insignifcant computation time. I came up with a horizontal_max_Vec4i function for SSE and a horizontal_max_Vec8i function for AVX (it does not need AVX2). Try replacing max(max(max(result[0], result[1]), result[2]), result[3]) with horizontal_max_Vec4i.

int horizontal_max_Vec4i(__m128i x) {
    __m128i max1 = _mm_shuffle_epi32(x, _MM_SHUFFLE(0,0,3,2));
    __m128i max2 = _mm_max_epi32(x,max1);
    __m128i max3 = _mm_shuffle_epi32(max2, _MM_SHUFFLE(0,0,0,1));
    __m128i max4 = _mm_max_epi32(max2,max3);
    return _mm_cvtsi128_si32(max4);
}

int horizontal_max_Vec8i(__m256i x) {
    __m128i low = _mm256_castsi256_si128(x);
    __m128i high = _mm256_extractf128_si256(x,1);
    return horizontal_max_Vec4i(_mm_max_epi32(low,high));
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • I have access to an Ivy bridge (i7-3770) and a Vischera (FX-8350). So, this code will not run with plain AVX which is supported by those workstations and will only run with Haswell (4770...). Right? – Alexandros Sep 03 '13 at 10:18
  • 1
    That's right. AVX does not have most of the 256-bit integer instructions. You need AVX2 for that. – Z boson Sep 03 '13 at 10:21
  • In my FX-8350 _mm256_add_epi32 and _mm256_max_epi32 are not recognized. On the other hand _mm256_store_si256 and _mm256_load_si256 seem to be OK. Can I replace _mm256_add_epi32 with 2 _mm128_add_epi32 type function? Is (and how) it possible? – Alexandros Sep 03 '13 at 15:12
  • 2
    @AlexandrosE. You would have to get the high and low part of the AVX register. Do `__m128i low = _mm256_castsi256_si128(ymm)` and `__m128i high = _mm256_extractf128_si256(ymm,1)`. – Z boson Sep 03 '13 at 15:17
  • The horizontal max is outside the loop (so it is not for 64 elements) but for 4. So, your horizontal SSE max version is slower. – Alexandros Sep 03 '13 at 20:37
  • @AlexandrosE, the horizontal max is only suppose to sum the last SSE(AVX) register which is 4(8)-wide. But I suspected it might be slower than just scalar code since it is several SSE instructions anyway. Thanks for testing this. – Z boson Sep 04 '13 at 07:26