0

I am trying to exploit the SIMD 512 offered by knc (Xeon Phi) to improve performance of the below C code using intel intrinsics. However, my intrinsic embedded code runs slower than auto-vectorized code

C Code

int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
radomize(myArray); //to fill some random data
int searchVal=24;
#pragma vector always
for(int i=0;i<SIZE;i++) {
   if (myArray[i]==searchVal) match++;
return match;

Intrinsic embedded code: In the below code I am first loading the array and comparing it with search key. Intrinsics return 16bit mask values that is reduced using _mm512_mask_reduce_add_epi32().

register int64_t match=0;
int *myArray __attribute__((align(64)));
myArray = (int*) malloc (sizeof(int)*SIZE); //SIZE is array size taken from user
const int values[16]=\
                {   1,1,1,1,\
                    1,1,1,1,\
                    1,1,1,1,\
                    1,1,1,1,\
                };
__m512i const flag = _mm512_load_epi32((void*) values);
__mmask16 countMask;

__m512i searchVal = _mm512_set1_epi32(16);
__m512i kV = _mm512_setzero_epi32();


for (int i=0;i<SIZE;i+=16)
{
   // kV = _mm512_setzero_epi32();
    kV = _mm512_loadunpacklo_epi32(kV,(void* )(&myArray[i]));
    kV = _mm512_loadunpackhi_epi32(kV,(void* )(&myArray[i + 16]));

    countMask = _mm512_cmpeq_epi32_mask(kV, searchVal);
    match += _mm512_mask_reduce_add_epi32(countMask,flag);
}
return match;

I believe I have some how introduced extra cycles in this code and hence it is running slowly compared to the auto-vectorized code. Unlike SIMD128 which directly returns the value of the compare in 128bit register, SIMD512 returns the values in mask register which is adding more complexity to my code. Am I missing something here, there must be a way out to directly compare and keep count of successful search rather than using masks such as XOR ops.

Finally, please suggest me the ways to increase the performance of this code using intrinsics. I believe I can juice out more performance using intrinsics. This was at least true for SIMD128 where in using intrinsics allowed me to gain 25% performance.

Boppity Bop
  • 9,613
  • 13
  • 72
  • 151

1 Answers1

1

I suggest the following optimizations:

  • Use prefetching. Your code performs very little computations, and almost surely bandwidth-bound. Xeon Phi has hardware prefetching only for L2 cache, so for optimal performance you need to insert prefetching instructions manually.
  • Use aligned read _mm512_load_epi32 as hinted by @PaulR. Use memalign function instead of malloc to guarantee that the array is really aligned on 64 bytes. And in case you will ever need misaligned instructions, use _mm512_undefined_epi32() as the source for the first misaligned load, as it breaks dependency on kV (in your current code) and lets the compiler do additional optimizations.
  • Unroll the array by 2 or use at least two threads to hide instruction latency.
  • Avoid using int variable as an index. unsigned int, size_t or ssize_t are better options.
Marat Dukhan
  • 11,993
  • 4
  • 27
  • 41
  • I was getting segmentation fault when using malloc and _mm512_load_epi32. however, it works now using memalign. Any suggestions on reducing the computation cycles for compare and reduce in my code. – user2749262 Feb 17 '14 at 02:51
  • You may try to replace `_mm512_mask_reduce_add_epi32(countMask,flag)` with `_mm_countbits(_mm512_mask2int(countMask))`, but I doubt that it will have any effect. – Marat Dukhan Feb 17 '14 at 05:05
  • Thanks for suggestions, _mm_countbits() is slower than using _mm512_mask_reduce_add_epi32() method. After taking into account all other suggestions in your previous comment, my code now runs 50% faster than auto vectorized code. I compiled the code as follows: icc -mmic -O -std=c99 – user2749262 Feb 17 '14 at 09:17
  • "Xeon Phi doesn't have hardware prefetching". Wrong. It has a hardware prefetcher to the L2. (See, for instance, slide 7 at http://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic.pdf ) – Jim Cownie Feb 20 '14 at 13:47
  • @JimCownie, you're right, but the point being is that the kernel in question needs software prefetching for better performance. – Marat Dukhan Feb 21 '14 at 02:54
  • @MaratDukhan Can you explain me with a small example code on how to use _mm512_undefined_epi32() to load misaligned array? – user2749262 Mar 10 '14 at 08:14
  • @user2749262 `kV = _mm512_loadunpacklo_epi32(_mm512_undefined_epi32(),(void* )(&myArray[i]));` – Marat Dukhan Mar 10 '14 at 20:46
  • @MaratDukhan So, I should write the code something like below `for (int i=0;i – user2749262 Mar 11 '14 at 03:41