2

I am finding all primes using the Sieve of Eratosthenes algorithm. I attempted to parallelize this algorithm. However, speedup stops increasing after two threads!

Speedup Graph

My code is essentially two for loops. The outer for loop is the one increments the counter by an invariant (2). Within this loop, the counter is incremented by 2*x where x is unknown.

   #pragma omp parallel for
   for (int k = 3; k <= sqrt_size; k = k+2) {
        for (int curPrime = k*k; curPrime <= size; curPrime += 2*k) marked[curPrime/2] = 1;
        while (marked[k/2]) k += 2; // Find following unmarked value (unknown amount)
   }

I think my lack of speedup past two threads is due to the unknown amount being added to the counter variable. Should I make the counter shared, and then place the increment or entire while loop within a critical section?

Casey
  • 444
  • 1
  • 7
  • 22
  • Your code contains a race condition on `marked` as some thread can write into it (inner for loop) while others can read it (inner while loop). I think you cannot parallelize the outer loop that way. The final result is probably currently wrong. – Jérôme Richard Oct 17 '20 at 23:24
  • Here's another answer for the same problem/issue: https://stackoverflow.com/questions/55634587/failed-performance-improvement-in-the-nested-for-loop-in-openmp/55635962#55635962 – Craig Estey Oct 18 '20 at 01:52
  • @CraigEstey surrounding this snippet, I implemented the suggestions posted in the answer he has. – Casey Oct 18 '20 at 20:29
  • @JérômeRichard how would I resolve this race condition? I found that parallelizing the inner for loop is most significant, so I can remove the outer for loop's pragma. Would I then just need to make the while loop a critical section? – Casey Oct 18 '20 at 20:31
  • 1
    Moving the `#pragma omp parallel for` just before the `curPrime` loop is fine. In this case you do not need a critical section since the while loop is outside the OpenMP section. The resulting code may be not efficient but it will be at least correct. I am curious of the resulting performance. – Jérôme Richard Oct 18 '20 at 20:52
  • 1
    With just the parallel for before the inner for loop, I get a speedup of ~60% for a problem size of 1.5 billion. With both for loops parallelized, it's about the same as just the outer loop parallelized. I am guessing it is due to the race condition you mentioned. – Casey Oct 19 '20 at 02:06

0 Answers0