SIMD code vs Scalar Code

Question

The following loop is executed hundreds of times.
elma and elmc are both unsigned long (64-bit) arrays, so is res1 and res2.

unsigned long simdstore[2];  
__m128i *p, simda, simdb, simdc;  
p = (__m128i *) simdstore;  

for (i = 0; i < _polylen; i++)  
{  

    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k++)  
    {     

    1.  //res1[i + k] ^= _mulpre1[u1][k];  
    2.  //res2[i + k] ^= _mulpre2[u2][k];               
    3.        _mm_prefetch ((const void *) &_mulpre2[u2][k], _MM_HINT_T0);
    4.        _mm_prefetch ((const void *) &_mulpre1[u1][k], _MM_HINT_T0);
    5.        simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);
    6.        _mm_prefetch ((const void *) &res2[i + k], _MM_HINT_T0); 
    7.        _mm_prefetch ((const void *) &res1[i + k], _MM_HINT_T0); 
    8.        simdb = _mm_set_epi64x (res2[i + k], res1[i + k]);  
    9.        simdc = _mm_xor_si128 (simda, simdb);  
    10.        _mm_store_si128 (p, simdc);  
    11.        res1[i + k] = simdstore[0];  
    12.        res2[i + k] = simdstore[1];                      
    }     
}

Within the for loop, scalar version of code (commented) runs twice as faster than simd code. With cachegrind output (Instruction reads) of the above lines is mentioned below.

Line 1: 668,460,000 2 2
Line 2: 668,460,000 1 1
Line 3: 89,985,000 1 1
Line 4: 89,985,000 1 1
Line 5: 617,040,000 2 2
Line 6: 44,992,500 0 0
Line 7: 44,992,500 0 0
Line 8: 539,910,000 1 1
Line 9: 128,550,000 0 0
Line 10: . . .
Line 11: 205,680,000 0 0
Line 12: 205,680,000 0 0

From the above figure, it appears that the commented (scalar code) requires significantly less number of instructions than simd code.

How can this code be made faster?

This is a duplicate of your own question. Go back there and understand why this answer (and not the answer you marked as correct) solves the problem: http://stackoverflow.com/questions/4394930/simd-code-runs-slower-than-scalar-code/4395337#4395337 — j_random_hacker, Dec 09 '10 at 12:27
I asked a new question because earlier I accepted one answer. But the thing remains, that the instruction count is more in case of simd code. Any thoughts there, please? — anup, Dec 09 '10 at 12:44
Which answer are you talking about when you say "the instruction count is more in case of SIMD code"? If you mean renick's answer, you're wrong -- check the comments on it. Also I don't understand why you accepted the answer that you accepted, if it didn't answer your question. — j_random_hacker, Dec 09 '10 at 12:53
renick's answer is correct, I have seen the comments on it. I accepted the answer because no other person was commenting on it, and I thought to close the matter. — anup, Dec 09 '10 at 13:02
If renick's code isn't faster than both your SIMD version and the non-SIMD version, I'd be very surprised. — j_random_hacker, Dec 09 '10 at 13:29
Trying with renick's code, but getting some error..when I get rid of this and get some result, I will let you know...I think the error may occurs when the index of res2 is odd or something. — anup, Dec 09 '10 at 15:45
@j_random_hacker Done with renick's code, but still it takes 1.5 times time compared to non-simd code.. — anup, Dec 09 '10 at 17:33
Ah, I had forgotten about that outer loop producing odd values of `i`... So in fact his version as it stands will crash due to alignment problems. Tricky because if `i` is odd, `i+k` (the index into `res1`) will be odd, but `k` (the index into `_mulpre1[u1]`) will be even. Not sure what you did to get around it, but I'd suggest doubling the outer loop so that you have `for (i = 0; i < _polylen; i += 2)` followed by `for (i = 1; i < _polylen; i += 2)`, and actually making separate copies of `_mulpre1` and `_mulpre2` that are offset by 1 position. Hope that makes sense. — j_random_hacker, Dec 09 '10 at 17:52
Thanx, I used _mm_loadu_si128, it works with unaligned numbers in contrast to _mm_load_si128 where the above mentioned problem occurs.. — anup, Dec 09 '10 at 17:55
The idea being that inside each of these 2 outer loops you would use renick's code as-is, except that in the 2nd outer loop you would use `_mulpre1offset` and `_mulpre2offset` instead. This means doubling the memory required for the `_mulpre` info, but this way the inside of each outer loop can still calculate 2 elements at a time without alignment errors. — j_random_hacker, Dec 09 '10 at 17:56
I see, those unaligned loads are slow... My guess is it will actually be faster to use 2 copies of the `_mulpre` info + aligned loads as I've suggested, although this could wind up slower if you run out of cache. — j_random_hacker, Dec 09 '10 at 17:57
Thanx a lot, though I didn't understand _mulpreoffset concept, removing the slow _mm_loadu with _mm_load helped. I did that with two i loops you mentioned above, with odd i's involving some shifts — anup, Dec 09 '10 at 19:21
Unfortunately I don't have the time to try coding up something for you. Yes, using shifts would be another way, although I suspect still quite slow. What I mean is: After the 1st outer loop finishes calculating results for all even values of `i`, calculate `res1[1]` and `res2[1]` using non-SIMD code, then have the 2nd outer loop calculate remaining aligned pairs ((2,3), (4,5), etc.) using SIMD code. This code would normally require unaligned accesses into `_mulpre1` etc, so that's why you create 8-byte shifted copies in `_mulpre1offset` beforehand and use them instead. — j_random_hacker, Dec 09 '10 at 19:51
what do you think about taking separate copies of res1, and res2 for even and odd i's and at the end xoring them? — anup, Dec 10 '10 at 04:33

score 3 · Accepted Answer · answered Dec 15 '10 at 14:37

Take out the _mm_prefetch intrinsics - they are achieving nothing in this context and may even be hurting performance. Prefetch is only of benefit if (a) you have bandwidth to spare and (b) you can issue the prefetch hint several hundred clock cycles ahead of when the data is actually needed. I think neither (a) nor (b) are true in your case.

score 1 · Answer 2 · answered Feb 04 '11 at 14:00

Your peformance problem is this:

_mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);

The mm_set(a,b,c,d) class of intrinsics are very slow. Only single parameter set intrinsics (aka broadcast) are fast.

I looked at what they do in assembly code.

They basically create an array on the stack, move your two integers from the multidimensional arrays they currently reside in, to a stack array, using normal memory moves (mov DWORD). Then from the stack array using an XMM memory move (mov XMWORD).

The scalar version goes directly from memory to registers. FASTER!

You see the overhead comes from the fact that that the XMM register can only be communicated with 128 bits at a time, so your program is first ordering the 128 bits in another area of memory before loading them.

If there's a way to move 64 bit values directly to or from a normal register to an XMM register i'm still looking for it.

To get a speed boost from using SSE/XMM registers your data will probably need to be already in order in memory. Loading out of order data into an XMM register is only worth it if you can do several XMM operations per out of order load. Here your doing a single XOR operation.

SIMD code vs Scalar Code

2 Answers2