SIMD code runs slower than scalar code

Question

elma and elmc are both unsigned long arrays. So are res1 and res2.

unsigned long simdstore[2];  
__m128i *p, simda, simdb, simdc;  
p = (__m128i *) simdstore;  

for (i = 0; i < _polylen; i++)  
{
    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k++)  
    {
        //res1[i + k] ^= _mulpre1[u1][k];  
        //res2[i + k] ^= _mulpre2[u2][k];               

        simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);  
        simdb = _mm_set_epi64x (res2[i + k], res1[i + k]);  
        simdc = _mm_xor_si128 (simda, simdb);  
        _mm_store_si128 (p, simdc);  
        res1[i + k] = simdstore[0];  
        res2[i + k] = simdstore[1];                     
    }     
}

Within the for loop is included both the non-simd and simd version of XOR of elements. First two lines within the second for loop do the explicit XOR, whereas the rest implements the simd version of the same operation.

This loop is called from outside hundreds of times, so optimizing this loop will help bring down the total computation time.

The problem is simd code runs many times slower than the scalar code.

EDIT: Done partial unrolling

__m128i *p1, *p2, *p3, *p4;  
p1 = (__m128i *) simdstore1;  
p2 = (__m128i *) simdstore2;  
p3 = (__m128i *) simdstore3;  
p4 = (__m128i *) simdstore4;  

for (i = 0; i < 20; i++)  
{
    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k = k + 4)  
    {
        simda1  = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);  
        simda2  = _mm_set_epi64x (_mulpre2[u2][k + 1], _mulpre1[u1][k + 1]);  
        simda3  = _mm_set_epi64x (_mulpre2[u2][k + 2], _mulpre1[u1][k + 2]);  
        simda4  = _mm_set_epi64x (_mulpre2[u2][k + 3], _mulpre1[u1][k + 3]);  

        simdb1  = _mm_set_epi64x (res2[i + k], res1[i + k]);  
        simdb2  = _mm_set_epi64x (res2[i + k + 1], res1[i + k + 1]);  
        simdb3  = _mm_set_epi64x (res2[i + k + 2], res1[i + k + 2]);  
        simdb4  = _mm_set_epi64x (res2[i + k + 3], res1[i + k + 3]);  

        simdc1  = _mm_xor_si128 (simda1, simdb1);  
        simdc2  = _mm_xor_si128 (simda2, simdb2);  
        simdc3  = _mm_xor_si128 (simda3, simdb3);  
        simdc4  = _mm_xor_si128 (simda4, simdb4);  

        _mm_store_si128 (p1, simdc1);  
        _mm_store_si128 (p2, simdc2);  
        _mm_store_si128 (p3, simdc3);  
        _mm_store_si128 (p4, simdc4);  

        res1[i + k]= simdstore1[0];  
        res2[i + k]= simdstore1[1]; 
        res1[i + k + 1]= simdstore2[0];  
        res2[i + k + 1]= simdstore2[1];   
        res1[i + k + 2]= simdstore3[0];  
        res2[i + k + 2]= simdstore3[1]; 
        res1[i + k + 3]= simdstore4[0];  
        res2[i + k + 3]= simdstore4[1];   
    }  
}

But, the result does not change much; it still takes twice as long as scalar code.

Align res1/res2, swizzle the registers, and write straight to them. Get rid of the simdstore. — EboMike, Dec 09 '10 at 05:35
res1/res2 are unsigned long arrays that I have used in my code. I am not sure what do you mean by align them by 16 bytes, and write to them directly. — anup, Dec 09 '10 at 05:47
I mean make sure that their starting address is aligned by 16 bytes, and then write to them directly using _mm_store_si128. (Obviously, you'll have to take two results and combine them into one by swizzling the vector register.) Writing to memory and then reading from it again will cause a stall. — EboMike, Dec 09 '10 at 05:50

EboMike · Accepted Answer · 2010-12-09T05:28:21.700

7

Disclaimer: I come from a PowerPC background, so what I'm saying here might be complete hogwash. But you're stalling your vector pipeline since you try to access your results right away.

It is best to keep everything in your vector pipeline. As soon as you do any kind of conversion from vector to int or float, or storing the result into memory, you're stalling.

The best mode of operation when dealing with SSE or VMX is: Load, process, store. Load the data into your vector registers, do all the vector processing, then store it to memory.

I would recommend: Reserve several __m128i registers, unroll your loop several times, then store it.

EDIT: Also, if you unroll, and if you align res1 and res2 by 16 bytes, you can store your results directly in memory without going through this simdstore indirection, which is probably a LHS and another stall.

EDIT: Forgot the obvious. If your polylen is typically large, don't forget to do a data cache prefetch on every iteration.

edited Dec 09 '10 at 05:28

answered Dec 09 '10 at 04:58

EboMike

76,846
14
164
167

4

Loop unrolling is good on PowerPC, but no so good on modern x86 CPUs, where there are fewer registers to play with and the CPU itself will do some unrolling for small loops. – Paul R Dec 09 '10 at 10:04
2

Good to know, thanks! In this particular case, I guess it makes sense to unroll at least once in order to construct the output in the vector registers (since you need to take two consecutive results and swizzle them together to one) - I guess one of the biggest problems here is the simdstore -> res1/res2 indirection. – EboMike Dec 09 '10 at 10:07
@Paul R In my experience unrolling loops in x86 is actually useful. It’s been awhile since I’ve profiled it on 32-bit, but certainly on x86-64, it is useful (there are a lot more registers in 64-bit); I consistently get speedups doing so. Depending on the case it gives the optimizer a change to really reorder things in a more optimal way, not to mention cutting down on counter variables or other operations that are done every loop that can be factored out by unrolling. – Apriori Feb 07 '14 at 22:30
@Rich: loop unrolling can certainly help in *some* cases, but in others it can be counter-productive, particularly if you unroll beyond the size of the Loop Stream Detector's uop cache. You also need to decide whether you're going to let the compiler unroll loops for you, or whether you're going to do it manually. Bottom line: unrolling is not the panacea that it once was, e.g. on older x86 CPUs or other architectures such as PowerPC. – Paul R Feb 07 '14 at 23:09

score 5 · Answer 2 · answered Dec 09 '10 at 10:08

You're doing very little computation here relative to the number of loads and stores that are being performed, so as a result you are seeing little benefit from SIMD. You will probably find it more useful in this case to use scalar code, particularly if you have an x86-64 CPU that you can use in 64-bit mode. This will reduce the number of loads and stores, which are currently the dominant factor in your performance.

(Note: you probably should NOT unroll the loop, especially if you're using Core 2 or newer.)

score 4 · Answer 3 · answered Dec 09 '10 at 06:00

4

The way your code looks res1 and res2 seem to be totally independent vectors. Yet you mix them in the same register to xor them.

I would use different registers something like the following (The vectors must all be aligned).

__m128i x0, x1, x2, x3;  
for (i = 0; i < _polylen; i++)  
{  

    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k+=2)  
    {     
        //res1[i + k] ^= _mulpre1[u1][k];
        x0= _mm_load_si128(&_mulpre1[u1][k]);
        x1= _mm_load_si128(&res1[i + k]);
        x0= _mm_xor_si128 (x0, x1);
        _mm_store_si128 (&res1[i + k], x0);
        //res2[i + k] ^= _mulpre2[u2][k];               
        x2= _mm_load_si128(&_mulpre2[u2][k]);
        x3= _mm_load_si128(&res2[i + k]);
        x2= _mm_xor_si128 (x2, x3);
        _mm_store_si128 (&res2[i + k], x2);
   }     
}

Note that I am using only 4 registers. You can manually unroll to use all 8 registers in x86 or more in x86_64

answered Dec 09 '10 at 06:00

renick

3,873
2
31
40

Be sure to align res1 and res2 properly. Btw, I thought res1 gets the first 8 bytes of the register, and res2 gets the other 8 bytes? SO you'd need to do two iterations and swizzle the results. – EboMike Dec 09 '10 at 06:02
@EboMike is correct. res1 gets the first 8 bytes whereas res2 gets the rest. @renick, the way you have implemented it does not have any extra advantage of SIMD, because you have used two separate xors, and it is just equivalent to scalar code. – anup Dec 09 '10 at 06:11
@renick mulpre and res are both 64-bit wide, so pushing them into SIMD registers may and xoring them may not be so useful... – anup Dec 09 '10 at 06:24
@anup If it is plain wrong then I should delete it. Otherwise note that there are 4 xor ops per loop iteration with 8 instructions so shouldn't be too shabby.. – renick Dec 09 '10 at 07:09
2

@renick: It's not wrong! @anup: He's computing *2 elements at a time* for *each* array, separately. It *is* taking advantage of SIMD: the 2 XOR calls per loop iteration are performing 4 operations in total. And it's gonna be faster than your approach of mixing the `res1[]` and `res2[]` arrays because it avoids the intermediate store to `simdstore`. – j_random_hacker Dec 09 '10 at 08:28
@anup, @EboMike: In each loop iteration, he's computing 2 elements for `res1[]`, then 2 elements for `res2[]`. This is fine because there are no dependencies between adjacent elements of the same array. It's fast, solid code. – j_random_hacker Dec 09 '10 at 08:30
2

@anup: Notice that this version uses `k+=2` instead of `k++` in the loop - it does half as many iterations. – caf Dec 09 '10 at 12:02
Btw doing your store using an _mm_stream_ps usually ends up much faster as you aren't bringing a new cache line in when you write. You only cache the read data. Writes really don't need the cacheing and you can probably get a significant speed up ... – Goz Dec 10 '10 at 19:27

score 2 · Answer 4 · answered Dec 09 '10 at 05:13

I'm no SIMD expert either, but it looks like you could also benefit from prefetching the data, coupled with the derolling EboMike mentioned. Might also help if you merged res1 and res2 into one aligned array(of structs depending on what else uses it), then you don't need the extra copying, you can operate directly on it.

SIMD code runs slower than scalar code

4 Answers4

Linked