1

I am new to multi threading programming. Recently, i have a project, which i apply cilk_for into it. Here is the code:.

void myfunction(short *myarray)
{
m128i *array = (m128i*) myarray
cilk_for(int i=0; i<N_LOOP1; i++)
    {
        for(int z = 0; z<N_LOOP2; z+=8)
        {
            array[z]        =  _mm_and_si128(array[z],mym128i);
            array[z+1]        =  _mm_and_si128(array[z+1],mym128i);
            array[z+2]        =  _mm_and_si128(array[z+2],mym128i);
            array[z+3]        =  _mm_and_si128(array[z+3],mym128i);
            array[z+4]        =  _mm_and_si128(array[z+4],mym128i);
            array[z+5]        =  _mm_and_si128(array[z+5],mym128i);
            array[z+6]        =  _mm_and_si128(array[z+6],mym128i);
            array[z+7]        =  _mm_and_si128(array[z+7],mym128i);
            array+=8;
        }
    }
}

After the above code ran, ridiculous thing happens. The data in array isn't updated correctly. For example, if i have an array with 1000 elements, there is a chance that the array will be updated correctly (1000 elements are AND-ed). But there is also a chance that some parts of the array will be omited (first element to 300th element are AND-ed, 301st element to 505th element aren't AND-ed, 506th element to 707th element are AND-ed, etc,...). These omited parts are random in each individual run, so i think the problem here is about cache miss. Am I right? Please tell me, any help is appreciated. :)

Piel
  • 11
  • 2

1 Answers1

0

The issue is that the array pointer is not synchronized between the threads cilk is spawning and your array variable is incremented in each loop iteration. This works only in a linear execution. In your code snippet multiple threads are accessing the same elements in your array while other parts of the array are not processed at all.

To solve this I would propose to calculate the index within the outer loop so that every thread spawned with Cilk is able to calculate the address independently. Maybe you can do something like:

void myfunction(short *myarray)
    {
    cilk_for (int i=0; i<N_LOOP1; i++)
        {
            m128i *array = (m128i*) myarray + i * N_LOOP2 * 8;
            for(int z = 0; z<N_LOOP2; z+=8)
            {
                array[z]        =  _mm_and_si128(array[z],mym128i);
                array[z+1]        =  _mm_and_si128(array[z+1],mym128i);
                array[z+2]        =  _mm_and_si128(array[z+2],mym128i);
                array[z+3]        =  _mm_and_si128(array[z+3],mym128i);
                array[z+4]        =  _mm_and_si128(array[z+4],mym128i);
                array[z+5]        =  _mm_and_si128(array[z+5],mym128i);
                array[z+6]        =  _mm_and_si128(array[z+6],mym128i);
                array[z+7]        =  _mm_and_si128(array[z+7],mym128i);
                array+=8;
            }
        }
    }

BTW: Why do you need to do a manual loop unrolling here? The compiler should do that automatically.

Alexander Weggerle
  • 1,881
  • 1
  • 11
  • 7
  • You're right. With your modification, the cilk_for is running correctly. But it is slower than normal for loop (about ~1.5x). – Piel Jun 02 '14 at 03:48
  • The manual unrolling here is because i think that it will amortize the overhead of cilk_for. Maybe i was wrong, because the running time of unroll-version is the same as normal version. If compiler unrolls loop automatically, should i remove this manual unroll? – Piel Jun 02 '14 at 03:56
  • Depending on the size of the array you are limited by memory bandwidth and not by the CPU. The slowdown using multiple threads might be because the threads are accessing different memory locations but sharing the L3 cache. A VTune run would help to understand that. – Alexander Weggerle Jun 02 '14 at 09:35
  • If the compiler does the unrolling automatically it's very easy to compile it for different architectures like Intel® Xeon Phi™. If you do the unrolling manually you might need to adopt the (manual) unrolling in such cases. – Alexander Weggerle Jun 02 '14 at 09:43