I am not 100% sure this will be much better, but you could try the following:
int count = 0;
#pragma omp parallel for
for(int i = 0; i < N; ++i) {
if(M[i]) {
#pragma omp atomic
T[count++] = i;
}
}
If the array is quite sparse, threads will be able to zip through a lot of zeros without waiting for others. But you can only update one index at a time. The problem is really that different threads are writing to the same memory block (T
), which means you will be running into issues of caching: every time one thread writes to T
, the cache of all the other cores is "dirty" - so when they try to modify it, a lot of shuffling goes on behind the scenes. All this is transparent to you (you don't need to write code to handle it) but it slows things down signficantly - I suspect that's your real bottleneck. If your matrix is big enough to make it worth your while, you might try to do the following:
- Create as many arrays T as there are threads
- Let each thread update its own version of T
- Combine all the T arrays into one after the loops have completed
It might be faster (because the different threads don't write to the same memory) - but since there are so few statements inside the loop, I suspect it won't be.
EDIT I created a complete test program, and found two things. First, the atomic
directive doesn't work in all versions of omp
, and you may well have to use T[count++] += i;
for it to even compile (which is OK since T can be set to all zeros initially); more troubling, you will not get the same answer twice if you do this (the final value of count
changes from one pass to the next); if you use critical
, that doesn't happen.
A second observation is that the speed of the program really slows down when you increase the number of threads, which confirms what I was suspecting about shared memory (times for 10M elements processed:
threads elapsed
1 0.09s
2 0.73s
3 1.21s
4 1.62s
5 2.34s
You can see this is true by changing how sparse matrix M
is - when I create M
as a random array, and test for M[i] < 0.01 * RAND_MAX
(0.1% dense matrix), things run much more quickly than if I make it 10% dense - showing that the part inside the critical
section is really slowing us down.
That being the case, I don't think there is a way of speeding up this task in OMP - the job of consolidating the outputs of all the threads into a single list at the end is just going to eat up any speed advantage you may have had, given how little is going on inside the inner loop. So rather than using multiple threads, I suggest you rewrite the loop as efficiently as possible - for example:
for( i = 0; i < N; i++) {
T[count] = i;
count += M[i];
}
In my quick benchmark, this was faster than the OMP solution - comparable with the threads = 1
solution. Again - this is because of the way memory is being accessed here. Note that I avoid using an if
statement - this keeps the code as fast as possible. Instead, I take advantage of the fact that M[i]
is always zero or one. At the end of the loop you have to discard the element T[count]
because it will be invalid... the "good elements" are T[0] ... T[count-1]
. An array M
with 10M elements was processed by this loop in ~ 0.02 sec on my machine. Should be sufficient for most purposes?