How to improve forking/joining of multithreading program?

Question

apparenty the OP got their answer already, in the comments, and the issue is resolved now.

I have coded a prime number program (sieve of eratosthenes) that executes using pthreads.

This is my first multithreading program and I don't know why my program takes roughly 3 mins. time to execute. Thats too much time!

Can someone tell me where exactly am I wrong:

#include<iostream>
#include<cstring>
#include<pthread.h>
#include<time.h>

using namespace std;

//set limits
#define  LIMIT   100000001
#define THREAD_LIMIT   8

//declare buffers
bool num[LIMIT];

unsigned long long num_of_prime = 1; // 2 is counted as prime initially 
unsigned long long sum_prime = 2;    // 2 is counted in sum of primes

void *search(void *);

int main()
{
    clock_t start_time = clock(); // start clock stamp

    pthread_t thread[THREAD_LIMIT];
    int thread_val=-1,j=-1;
    unsigned long long i=3;
    bool *max_prime[10];    // stores max. 10 prime numbers

    memset(num,0,LIMIT);    // initialize buffer with 0 

    while(i<LIMIT)
    {
        if(num[i]==0)
        {
            num_of_prime++;
            sum_prime +=i;
            j = ++j%10;
            max_prime[j]=num+i;
            thread_val=++thread_val%THREAD_LIMIT; 
            pthread_join(thread[thread_val],NULL);  // wait till the current thread ends
            pthread_create(&thread[thread_val],NULL,search,(void *)i); // fork thread function to flag composite numbers
        }   
        i+=2;   // only odd numbers
    }

    // end all threads
    for(i=0;i<THREAD_LIMIT;i++)
    {
        pthread_join(thread[i],NULL); 
    }

    cout<<"Execution time: "<<((double)(clock() - start_time))/CLOCKS_PER_SEC<<"\n";
    cout<<"Number of Primes: "<<num_of_prime<<"\n";
    cout<<"Sum of Primes: "<<sum_prime<<"\n";
    cout<<"List of 10 Max. Primes: "<<"\n";
    for(i=0;i<10;i++)
    {
        j=++j%10;
        cout<<(max_prime[j]-num)<<"\n";
    }
    return 0;
}

void *search(void *n)
{
    unsigned long long jump = (unsigned long long int)n;
    unsigned long long position = jump*jump; // Jump to N*N th comppsite number
    bool *posn = num;

    jump<<=1; 
    while(position<LIMIT)
    {

        (*(posn+position))?(position+=jump):(*(posn+position)=1,position+=jump);

    } 
    return NULL;
}

Contraints: Only 8 threads can be forked.

N: 10^8

How can I improve the efficiency of this code (especially in forking & joining the threads)?

Just like that, your first `pthread_join()` call has a problem in the sense that it will join on a thread that may not yet be finished, when another thread may already be finished. The loop in the middle is probably okay, except that you're going to call join on threads that are not running anymore... Finally, your pthread is NOT initialized and yet you call `pthread_join()` on it... Rather bad if you ask me. — Alexis Wilke, Sep 09 '14 at 00:03
Instead of forking thread after thread, where you have to join before forking again when you are at the limit, why not redesign your solution so that you just fork the right number of threads, and let each do 1/8th of the total task? — jxh, Sep 09 '14 at 00:14
@jxh: it's hard to divide this task up without starting it, in particular, it starts each pass of the algorithm at the next available prime, and the value of that prime is only evident once all previous primes have finished being processed. Not that it isn't possible, but it's certainly not trivial. — Wug, Sep 09 '14 at 00:17
@Wug: Pipeline processing. Assume 8 threads, T[i]. Each thread is given 1/8th part of the sieve to work. T[0] starts on 2 until it reaches its end, then hands off to T[1], and then T[0] starts on 3. Thus T[0] is sieving 3 while T[1] is sieving 2. Eventually, when T[0] is sieving 19, all the threads will be busy sieving. That's just off the top of my head. — jxh, Sep 09 '14 at 00:30
Here's a mathematical optimization that will probably help you more than anything else: You only have to sieve up to sqrt(LIMIT), not all the way up to LIMIT. — Wug, Sep 09 '14 at 00:36
No, it's true in general. No number in your sieve of size LIMIT will have a factor greater than sqrt(LIMIT) unless it also has a factor less than SQRT(limit). If a and b are your factors, and a and b are both > sqrt(LIMIT), than ab > LIMIT. notice now this optimization is already employed in the search function, wherein marking composite numbers starts at `jump*jump` (which will be outside of LIMIT for any jump > sqrt(LIMIT) — Wug, Sep 09 '14 at 01:26
To be honest, I don't know why you're using threads for this. Unless it's an assignment. Here's an ideone I threw together that sieves the first hundred million primes in less than ideone's 5 second time limit (wolfram alpha verifies that the quantity is correct, but I didn't print them all) http://ideone.com/mEHr2p — Wug, Sep 09 '14 at 01:55
@jxh: If LIMIT-1 is not a prime, it will have a factor less than or equal to sqrt(LIMIT - 1), which is strictly less than sqrt(LIMIT). — Wug, Sep 09 '14 at 03:56
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/60860/discussion-between-wug-and-jxh). — Wug, Sep 09 '14 at 04:00

score 0 · Answer 1 · answered Sep 12 '14 at 00:49

My experience is that throwing some threads at the problem speeds things up, but disappointingly little, for primes up to a large N.

I tried splitting up the sieve into chunks, one per thread. One thread generates a list of primes up to sqrt(N), then all threads crunch away at their part of the sieve, knocking out multiples of the primes. The idea was to have as little interaction between the threads as possible -- they all crunch away at their portion of the sieve, independently.

Your code appears to start a new thread to knock out the multiples for each prime found. The overhead of starting/stopping that many threads fills me with dismay ! I'm damned if I can see how you avoid having the threads trip over each other -- but I assume they don't ?

FWIW, for primes up to 10^8, I manage:

unthreaded: 0.160 secs elapsed, 0.140 secs user
5 threads: 0.040 secs elapsed, 0.130 secs user.

on a relatively modest x86_64 machine.

For 10^10:

unthreaded: 39.260 secs elapsed, 37.910 secs user
5 threads: 23.680 secs elapsed, 110.120 secs user.

which is deeply disappointing. I think the problem is that the cache is being swamped... the code processes each prime in turn and knocks out all its multiples, so is sweeping from one end of a chunk to the other, and then going back to the beginning. It may actually be better to bang away at, say, 512K of the sieve, for all primes, and then repeat.

0.16 secs is not bad at all. I have a simple C++ on http://ideone.com/fapob, and assuming your machine is 3x faster, it'd take 0.26 secs. So you did one better. Was it wheels? — Will Ness, Sep 12 '14 at 13:36
My machine is AMD Phenom II X6 1090T (3200) so relatively modest... dunno how it compares. It's C, and a lot more of it than the C++ referred to. — , Sep 12 '14 at 19:31
@WillNess: I have a number that I already computed. I don't think 0.16 is a reasonable value for the hardware. I suspect a missing 0. — jxh, Sep 13 '14 at 05:08
@jxh it's entirely reasonable. have you looked into my ideone link? It ran for 0.68s for 10^8, and Ideone is usually 3x slower than any common box today. — Will Ness, Sep 13 '14 at 05:12
@jxh: for 10^8 I found 5,761,455, as expected. When you say 0.16 is not a reasonable value, do you think it should be 1.6 or 0.016 ? — , Sep 13 '14 at 10:11
@gmch: Nevermind, I see how my implementation would be slow compared to what Will Ness did. — jxh, Sep 13 '14 at 15:36

How to improve forking/joining of multithreading program?

1 Answers1