3

My multithreaded C program runs the following routine :

#define NUM_LOOP 500000000
long long sum = 0;

void* add_offset(void *n){
        int offset = *(int*)n;
        for(int i = 0; i<NUM_LOOP; i++) sum += offset;
        pthread_exit(NULL);
}

Of Course sum should be updated by acquiring a lock, but before that I have an issue with the running time of this simple program.

When the main function is (Single Thread):

int main(void){

        pthread_t tid1;
        int offset1 = 1;
        pthread_create(&tid1,NULL,add_offset,&offset1);
        pthread_join(tid1,NULL);
        printf("sum = %lld\n",sum); 
        return 0;
}

The output and running time are :

sum = 500000000

real    0m0.686s
user    0m0.680s
sys     0m0.000s

When the main function is (Multi Threaded Sequential) :

int main(void){

        pthread_t tid1;
        int offset1 = 1;
        pthread_create(&tid1,NULL,add_offset,&offset1);
        pthread_join(tid1,NULL);

        pthread_t tid2;
        int offset2 = -1;
        pthread_create(&tid2,NULL,add_offset,&offset2);
        pthread_join(tid2,NULL);

        printf("sum = %lld\n",sum);

        return 0;
}

The output and running time are :

sum = 0

real    0m1.362s
user    0m1.356s
sys     0m0.000s

So far the program runs as expected. But when the main function is (Multi Threaded Concurrent):

int main(void){

        pthread_t tid1;
        int offset1 = 1;
        pthread_create(&tid1,NULL,add_offset,&offset1);

        pthread_t tid2;
        int offset2 = -1;
        pthread_create(&tid2,NULL,add_offset,&offset2);

        pthread_join(tid1,NULL);
        pthread_join(tid2,NULL);

        printf("sum = %lld\n",sum);

        return 0;
}

The output and running time are :

sum = 166845932

real    0m2.087s
user    0m3.876s
sys     0m0.004s

The erroneous value of sum due to lack of synchronization is not the issue here, but the running time. The actual running time of concurrent execution far exceeds that of the sequential execution. It is opposite to what is expected of concurrent execution in a multicore CPU.

Please explain what might be the problem here.

Aroonalok
  • 631
  • 1
  • 7
  • 18
  • Have you looked at the assembler for the three variations? Is there a difference you can't account for? Have you used the same optimization everywhere? How much optimization did you use? Which compiler did you use, on which version/variant of Linux? – Jonathan Leffler Jun 18 '17 at 07:16
  • 1) gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609 2) I simply compiled it with gcc -std=c99 -pthread 3) I am using Ubuntu 16.04.2 with kernel : 4.4.0-79-generic – Aroonalok Jun 18 '17 at 07:19

2 Answers2

3

This is not an uncommon effect if multiple threads access the same shared state (at least on x86). It is commonly called cache line ping-pong:

Everytime one core wants to update the value of that variable, it first has to take "ownership" of the cache line (lock the cache line for writing) from the other core, which takes some time. Then the other core wants the cache line back...

So even without a synchronization primitive you are paying a significant overhead compared to the sequential case.

MikeMB
  • 20,029
  • 9
  • 57
  • 102
  • Thanks for the link to cache-line ping-pong – Aroonalok Jun 18 '17 at 07:44
  • 1
    @Aroonalok> broader that just this example, shared state is generally expensive and should be avoided. You could avoid that cost for instance by having each thread compute a sum locally and only update the shared state at the end. – spectras Jun 18 '17 at 13:20
0

As suggested by @spectras, I made the following changes to the add_offset routine:

#define NUM_LOOP 500000000
long long sum = 0;

void* add_offset(void *n){
        int offset = *(int*)n;
        long long sum_local = sum; //read sum
        for(int i = 0; i<NUM_LOOP; i++) sum_local += offset;
        sum = sum_local; //write to sum
        pthread_exit(NULL);
}

The main function for multithreaded-concurrent execution remaining the same as above, the runtime is now as expected, i.e. :

sum = 500000000

real    0m0.683s
user    0m1.356s
sys     0m0.000s

Yet another output and runtime are:

sum = -500000000

real    0m0.686s
user    0m1.360s
sys     0m0.000s

These two and only these two values of the output are expected as the threads are not synchronized. The value of sum in the output reflects which thread (with offset=1 or offset=-1) updated sum at last.

Aroonalok
  • 631
  • 1
  • 7
  • 18