2

I am an OpenMP beginner. I come across such a problem.

I have a mask array M with the length N, whose element is either 0 or 1. I hope to extract all indices i that satisfies M[i]=1 and store them into a new array T.

Can this problem be accelerated by OpenMP?

I have tried following code. But it is not performance effective.

int count = 0;
#pragma omp parallel for
for(int i = 0; i < N; ++i) {
    if(M[i] == hashtag) {
        int pos = 0;
        #pragma omp critical (c1)
        pos = count++;
        T[pos] = i;
}
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
konjac
  • 757
  • 8
  • 14

3 Answers3

5

I am not 100% sure this will be much better, but you could try the following:

int count = 0;
#pragma omp parallel for
for(int i = 0; i < N; ++i) {
    if(M[i]) {
        #pragma omp atomic
        T[count++] = i;
    }
}

If the array is quite sparse, threads will be able to zip through a lot of zeros without waiting for others. But you can only update one index at a time. The problem is really that different threads are writing to the same memory block (T), which means you will be running into issues of caching: every time one thread writes to T, the cache of all the other cores is "dirty" - so when they try to modify it, a lot of shuffling goes on behind the scenes. All this is transparent to you (you don't need to write code to handle it) but it slows things down signficantly - I suspect that's your real bottleneck. If your matrix is big enough to make it worth your while, you might try to do the following:

  1. Create as many arrays T as there are threads
  2. Let each thread update its own version of T
  3. Combine all the T arrays into one after the loops have completed

It might be faster (because the different threads don't write to the same memory) - but since there are so few statements inside the loop, I suspect it won't be.

EDIT I created a complete test program, and found two things. First, the atomic directive doesn't work in all versions of omp, and you may well have to use T[count++] += i; for it to even compile (which is OK since T can be set to all zeros initially); more troubling, you will not get the same answer twice if you do this (the final value of count changes from one pass to the next); if you use critical, that doesn't happen.

A second observation is that the speed of the program really slows down when you increase the number of threads, which confirms what I was suspecting about shared memory (times for 10M elements processed:

threads  elapsed
  1        0.09s
  2        0.73s
  3        1.21s
  4        1.62s
  5        2.34s

You can see this is true by changing how sparse matrix M is - when I create M as a random array, and test for M[i] < 0.01 * RAND_MAX (0.1% dense matrix), things run much more quickly than if I make it 10% dense - showing that the part inside the critical section is really slowing us down.

That being the case, I don't think there is a way of speeding up this task in OMP - the job of consolidating the outputs of all the threads into a single list at the end is just going to eat up any speed advantage you may have had, given how little is going on inside the inner loop. So rather than using multiple threads, I suggest you rewrite the loop as efficiently as possible - for example:

for( i = 0; i < N; i++) {
  T[count] = i;
  count += M[i];
}

In my quick benchmark, this was faster than the OMP solution - comparable with the threads = 1 solution. Again - this is because of the way memory is being accessed here. Note that I avoid using an if statement - this keeps the code as fast as possible. Instead, I take advantage of the fact that M[i] is always zero or one. At the end of the loop you have to discard the element T[count] because it will be invalid... the "good elements" are T[0] ... T[count-1]. An array M with 10M elements was processed by this loop in ~ 0.02 sec on my machine. Should be sufficient for most purposes?

Floris
  • 45,857
  • 6
  • 70
  • 122
  • I tried your code as it looks like a good synchronization example but it doesn't compile with VS'12 and GCC 4.7, Compiler error message is : `Expression following '#pragma omp atomic' has improper form`! Have you an idea ? – alexbuisson Aug 07 '13 at 19:29
  • 1
    Yes - `atomic` needs `+=` with gcc 4.7 . I discovered that when I started running the code myself. See updated answer. – Floris Aug 07 '13 at 19:40
  • +1, also found that on MSDN, in fact the allowed operation recall a bit the Win32 API (InterlockedIncrement, etc ...). Thank – alexbuisson Aug 07 '13 at 19:54
  • I think that the problem you have with the atomic (getting a different count for every run) is because what is being atomic is only the assignment and not the increment of count. Acording to the OpenMP spec you should use `#pragma omp atomic capture\n {structured block}` – ees_cu Aug 08 '13 at 11:54
  • @ees_cu thanks for the pointer. I will look into that. I fear It won't get around the problem of memory access conflicts, but at least it should give the right answer... – Floris Aug 08 '13 at 12:35
  • @Floris, you can create an array with size N*num_threads and fill that array in parallel with your fast function. I already tried it and it's 2-3 times as fast on my system as your fast function. It wastes a lot of space but it is faster. I can post the code if you're interested. – Z boson Aug 15 '13 at 14:36
  • @redrum - I would be interested to see. I wonder how you combine the outputs of the parallel loops in a way that doesn't end up taking more time. I thought perhaps a `memcpy` at the end of each loop might do it. I believe the sub arrays can be N/numthreads in size each (since you consider at most that many elements). Looking forward to seeing what you did! – Floris Aug 15 '13 at 14:42
  • @Floris, I did not combine the output. Well I tried it and the result was a wash. I'm not sure it needs to be combined. I think you're correct that the array size can be N/numthreads. – Z boson Aug 15 '13 at 14:57
2

Based on Floris's fast function I tried to see if I could find a way to find a faster solution with OpenMP. I came up with two functions foo_v2 and foo_v3 which are faster for larger arrays, foo_v2 is faster independent of density and foo_v3 is faster for sparser arrays. The function foo_v2 essentially creates a 2D array with width N*nthreads as well as an array countsa which contains the counts for each thread. This is better explained with code. The following code would loop over all the elements written out to T.

for(int ithread=0; ithread<nthreads; ithread++) {
    for(int i=0; i<counta[ithread]; i++) {
        T[ithread*N/nthread + i]
    } 
}

The functionfoo_v3creates a 1D array as requested. In all casesNhas to be pretty large to overcome the OpenMP overhead. The code below defaults to 256MB with a density ofMabout 10%. The OpenMP functions are both faster by over a factor of 2 on my 4 core Sandy Bridge system. If you put the density at 50%foo_v2is faster still by about about a factor of 2 butfoo_v3is no longer faster.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int foo_v1(int *M, int *T, const int N) {   
    int count = 0;
    for(int i = 0; i<N; i++) {
        T[count] = i;
        count += M[i];
    }
    return count;
}

int foo_v2(int *M, int *T, int *&counta, const int N) {
    int nthreads;
    #pragma omp parallel
    {
        nthreads = omp_get_num_threads();
        const int ithread = omp_get_thread_num();   
        #pragma omp single
        counta = new int[nthreads];
        int count_private = 0;      
        #pragma omp for
        for(int i = 0; i<N; i++) {
            T[ithread*N/nthreads + count_private] = i;
            count_private += M[i];
        }
        counta[ithread] = count_private;
    }
    return nthreads;
}

int foo_v3(int *M, int *T, const int N) {
    int count = 0;

    int *counta = 0;    
    #pragma omp parallel reduction(+:count)
    {
        const int nthreads = omp_get_num_threads();
        const int ithread = omp_get_thread_num();   
        #pragma omp single
        {
            counta = new int[nthreads+1];
            counta[0] = 0;
        }
        int *Tprivate = new int[N/nthreads];

        int count_private = 0;      
        #pragma omp for nowait
        for(int i = 0; i<N; i++) {
            Tprivate[count_private] = i;
            count_private += M[i];
        }
        counta[ithread+1] = count_private;
        count += count_private;

        #pragma omp barrier
        int offset = 0;
        for(int i=0; i<(ithread+1); i++) {
            offset += counta[i];
        }

        for(int i=0; i<count_private; i++) {
            T[offset + i] = Tprivate[i];
        }

        delete[] Tprivate;
    }
    delete[] counta;
    return count;
}

void compare(const int *T1, const int *T2, const int N, const int count, const int *counta, const int nthreads) {
    int diff = 0;
    int n = 0;
    for(int ithread=0; ithread<nthreads; ithread++) {
        for(int i=0; i<counta[ithread]; i++) {
            int i2 = N*ithread/nthreads+i;
            //printf("%d %d\n", T1[n], T2[i2]);
            int tmp = T1[n++] - T2[i2];
            if(tmp<0) tmp*=-1;
            diff += tmp;
        }
    }
    printf("diff %d\n", diff);
}

void compare_v2(const int *T1, const int *T2, const int count) {
    int diff = 0;
    int n = 0;
    for(int i=0; i<count; i++) {
        int tmp = T1[i] - T2[i];
        //if(tmp!=0) printf("%i %d %d\n", i, T1[i], T2[i]);
        if(tmp<0) tmp*=-1;
        diff += tmp;
    }
    printf("diff %d\n", diff);
}

int main() {
    const int N = 1 << 26;

    printf("%f MB\n", 4.0*N/1024/1024);
    int *M = new int[N];
    int *T1 = new int[N];
    int *T2 = new int[N];
    int *T3 = new int[N];

    int *counta;
    double dtime;
    for(int i=0; i<N; i++) {
        M[i] = ((rand()%10)==0);
    }

    //int repeat = 10000;
    int repeat = 1;
    int count1, count2;
    int nthreads;
    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) count1 = foo_v1(M, T1, N);
    dtime = omp_get_wtime() - dtime;
    printf("time v1 %f\n", dtime);

    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) nthreads = foo_v2(M, T2, counta, N);
    dtime = omp_get_wtime() - dtime;
    printf("time v2 %f\n", dtime);
    compare(T1, T2, N, count1, counta, nthreads);

    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) count2 = foo_v3(M, T3, N);
    dtime = omp_get_wtime() - dtime;
    printf("time v2 %f\n", dtime);
    printf("count1 %d, count2 %d\n", count1, count2);
    compare_v2(T1, T3, count1); 

}
Z boson
  • 32,619
  • 11
  • 123
  • 226
1

The critical operation should be atomic instead of critical; actually in your case you have to use the atomic capture clause:

int pos, count = 0;                     // pos declared outside the loop
#pragma omp parallel for private(pos)   // and privatized, count is implicitly
for(int i = 0; i < N; ++i) {            // shared by all the threads
    if(M[i]) {
        #pragma omp atomic capture
            pos = count++;
        T[pos] = i;
    }
}

Take a look at this answer to have an overview over all the possible possibilities of atomic operations with OpenMP.

Community
  • 1
  • 1
Kyle_the_hacker
  • 1,394
  • 1
  • 11
  • 27
  • @VladimirF: The `capture` clause was introduced in OpenMP 3.1; you need a compliant compiler: GCC since v4.7, Intel C++ Compiler since v10.1, Oracle Solaris Studio compiler since v12.3, IBM XL C/C++ since v12.1 (or earlier, couldn't find any earlier document), probably most other compilers **except** any MSVC compiler (including 2013 Preview), which don't support OpenMP versions higher than 2.0 (released in 2000)... – Kyle_the_hacker Aug 08 '13 at 15:21