On Linux GCC/pthread parallel code is much slower than simple single thread code

Question

I am testing pthread parallel code on Linux with gcc (GCC) 4.8.3 20140911, on a CentOS 7 Server.

The single thread version is simple, it is used to init a 10000 * 10000 matrix :

int main(int argc)
{
    int size = 10000;

    int * r = (int*)malloc(size * size * sizeof(int));
    for (int i=0; i<size; i++) {
            for (int j=0; j<size; j++) {
                r[i * size + j] = rand();
            }
    }
    free(r);
}

Then I wanted to see if parallel code can improve the performance:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

int size = 10000;

void *SetOdd(void *param) 
{
   printf("Enter odd\n"); 
   int * r      = (int*)param;
   for (int i=0; i<size; i+=2) {
         for (int j=0; j<size; j++) {
                r[i * size + j] = rand();
         }
   }
   printf("Exit Odd\n");
   pthread_exit(NULL);
   return 0;
} 

void *SetEven(void *param) 
{ 
   printf("Enter Even\n");
   int * r      = (int*)param;
   for (int i=1; i<size; i+=2) {
        for (int j=0; j<size; j++) {
                r[i * size + j] = rand();
        }
   }
   printf("Exit Even\n");
   pthread_exit(NULL);
   return 0;
} 

int main(int argc)
{
     printf("running in thread\n");
     pthread_t threads[2];
     int * r = (int*)malloc(size * size * sizeof(int));
     int rc0 = pthread_create(&threads[0], NULL, SetOdd, (void *)r); 
     int rc1 = pthread_create(&threads[1], NULL, SetEven, (void *)r); 
     for(int t=0; t<2; t++) {
           void* status;
           int rc = pthread_join(threads[t], &status);
           if (rc)  {
               printf("ERROR; return code from pthread_join()   is %d\n", rc);
               exit(-1);
            }
            printf("Completed join with thread %d status= %ld\n",t,      (long)status);
        }

   free(r);
   return 0;
}

The simple code runs for about 0.8 second, while the multiple threaded version runs for about 10 seconds!!!!!!!

I am running on a 4 core server. But why the multiple threaded version is so slow ?

The code is likely to block on the mutex inside the `rand()`, since it is guarantees certain sequence of the produced numbers. You need to learn to use the profilers (e.g. gprof) to actually identify the bottlenecks. — Dummy00001, Dec 02 '15 at 10:18
Both `valgrind --tool=callgrind` and `gprof` (on static build of the application) clearly shows where the bottleneck is. And it is indeed in the `rand()`. Cheers. — Dummy00001, Dec 02 '15 at 10:33

P.P · Answer 1 · 2015-12-02T11:19:36.857

rand() is neither thread-safe nor re-entrant. So you can't use rand() in multi-threaded applications.

Use rand_r() instead which is also a pseudo-random generator and is thread-safe. If you care about. Using rand_r() results in shorter execution time for your code on my system with 2 cores (roughly half the time as the single threaded version).

In both of your threads functions, do:

void *SetOdd(void *param)
{
   printf("Enter odd\n");
   unsigned int s = (unsigned int)time(0);

   int * r      = (int*)param;
   for (int i=0; i<size; i+=2) {
         for (int j=0; j<size; j++) {
                r[i * size + j] = rand_r(&s);
         }
   }
   printf("Exit Odd\n");
   pthread_exit(NULL);
   return 0;
}

Update:

While C and POSIX standards do mandate rand() to be a thread-safe function, the glibc implementation (used on Linux) actually does implement it in a thread-safe manner.

If we look at the glibc implementation of the rand(), there's a lock:

 291   __libc_lock_lock (lock);
 292 
 293   (void) __random_r (&unsafe_state, &retval);
 294 
 295   __libc_lock_unlock (lock);
 296

Any synchronization construct (mutex, conditional variable etc) is bad for performance i.e. the least number of such constructs used in the code the better it is for performance (of course, we can't avoid certain them completely in multi-threaded applications).

So only one thread can actually access the random number generator as both threads are fighting for the lock all the time. This explains why rand() leads to poor performance in multi-threaded code.

Using `time()` as random seed in a multi-threaded application is inadvisable. Add thread is (`pthread_self()`) or an address of the stack variable or use nono-second portion of the time from `clock_gettime()` to randomize it. — Dummy00001, Dec 02 '15 at 10:50
I agree it might not produce desired results. I used it as a "cheap" replacement for the thread safety issue. I'll update the answer later. If the quality of the random numbers matters, then `drand48_r()` and friends can be used. `pthread_self()` might not be a good idea since `pthread_t` is an opaque type and can't relied upon for a integer representation of some sort. At least on Linux `gettid()` can be used instead for the same effect. — P.P, Dec 02 '15 at 10:57
it is a merely a pendantic comment to indicate that the code as it is can't be relied upon in production. So that the multi-threading beginners do not get any silly ideas ;) — Dummy00001, Dec 02 '15 at 11:17
@Dummy00001 But it's valid point though. I don't mind it at all ;) — P.P, Dec 02 '15 at 11:39

score 3 · Answer 2 · answered Dec 02 '15 at 10:45

The rand() function is designed to produce a predictable sequence of the random numbers (and the seed of the sequence can be controlled by the srand() function). That implies that the function has internal state, in all likelihood protected by a mutex.

The presence of the lock can be confirmed by using e.g. gprof or valgrind --tool=callgrind tools. (For gprof to detect the problems related to the standard library, you would need to compile/link the application with -static.)

In single-threaded mode, the mutex is inactive. But in multi-threaded mode, the mutex causes permanent collisions and stalls of the threads, both fighting to acquire the same lock in a tight loop. That severely degrade the multi-threaded performance.

On Linux GCC/pthread parallel code is much slower than simple single thread code

2 Answers2