Why is the multithreaded version of this program slower?

Question

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.

When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.

Here is the code:

#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

#define TIME_INTERVAL 100
#define CHANGES 5000

#define UNUSED(x) ((void) x)

typedef struct {
    unsigned int tid;
} parm;

static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;

static unsigned long int diff[NTHREADS] = {0};

void* array_modifier(void* args);
void* change_detector(void* arg);

int main(int argc, char** argv) {
    if (argc < 2) {
        exit(1);
    }

    N = (unsigned int)strtoul(argv[1], NULL, 0);

    my_array = calloc(N, sizeof(int));
    time_array = malloc(N * sizeof(struct timeval));
    old_value = calloc(N, sizeof(int));

    parm* p = malloc(NTHREADS * sizeof(parm));
    pthread_t generator_thread;
    pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));

    for (unsigned int i = 0; i < NTHREADS; i++) {
        p[i].tid = i;
        pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
    }

    pthread_create(&generator_thread, NULL, array_modifier, NULL);

    pthread_join(generator_thread, NULL);

    usleep(500);

    for (unsigned int i = 0; i < NTHREADS; i++) {
        pthread_cancel(detector_thread[i]);
    }

    for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
    fprintf(stderr, "\n");
    _exit(0);
}


void* array_modifier(void* arg) {
    UNUSED(arg);
    srand(time(NULL));

    unsigned int changing_signals = CHANGES;

    while (changing_signals--) {
        usleep(TIME_INTERVAL);
        const unsigned int r = rand() % N;

        gettimeofday(&time_array[r], NULL);
        my_array[r] ^= 1;
    }

    pthread_exit(NULL);
}

void* change_detector(void* arg) {
    const parm* p = (parm*) arg;
    const unsigned int tid = p->tid;
    const unsigned int start = tid * (N / NTHREADS) +
                               (tid < N % NTHREADS ? tid : N % NTHREADS);
    const unsigned int end = start + (N / NTHREADS) +
                             (tid < N % NTHREADS);
    unsigned int r = start;

    while (1) {
        unsigned int tmp;
        while ((tmp = my_array[r]) == old_value[r]) {
            r = (r < end - 1) ? r + 1 : start;
        }

        old_value[r] = tmp;
        if (tmp) {
            struct timeval tv;
            gettimeofday(&tv, NULL);
            // detection time in usec
            diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
        }
    }
}

when I compile & run like this:

gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100

I get:

but when I compile & run like this:

gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100

I get:

152 190 164 242

(this sums up to 748).

So, the delay for the multithreaded program is larger.

My cpu has 6 cores.

Can you possibly post less code? Do we really need to understand all this to be able to help? — meaning-matters, Aug 14 '15 at 08:42
I read this as one thread takes 665 time units, and the multithreaded one is finished after 242 times. Isn't that better? — Bo Persson, Aug 14 '15 at 08:44
@meaning-matters sorry, i tried to minimize the code as much as possible. Most of the stuff in the beginning are there so they can make the code work, the most important stuff are the 2 functions — Giannis M., Aug 14 '15 at 08:49
@BoPersson No. `diff` is measured in microseconds. For example, If i have 1 thread and change elements r=0 and r=1 each taking 30 usec to detect `diff[0]` will be `60`. If I have 2 threads and change elements r=0 and r=1 each taking 40 usec `diff` will be `40 40`. So, the 2-threaded is less responsive. What I care about is the sum of `diff` — Giannis M., Aug 14 '15 at 08:55
@GiannisM. - Creating threads isn't free, and they likely compete for resources when they run. I still think going from 60 us to 40 us is an improvement - the program will finish its work sooner. — Bo Persson, Aug 14 '15 at 09:01
@BoPersson No, it will not, in both cases it will finish after approximately `CHANGES * TIME_INTERVAL` microseconds `gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && time ./a.out 100 `145 129 541 547` `./a.out 100 3.11s user 0.01s system 397% cpu 0.785 total` `gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && time ./a.out 100` `625` `./a.out 100 0.78s user 0.00s system 100% cpu 0.772 total` — Giannis M., Aug 14 '15 at 09:09
Calling srand() from the thread callback function is definitely a bug. Also, are you sure gettimeofday() is re-entrant? — Lundin, Aug 14 '15 at 09:09
Also, time_array is not protected by a mutex, so there's a potential for all kinds of multi-thread bugs. — Lundin, Aug 14 '15 at 09:11
@Lundin `srand` is called from only one thread (the modifier thread). Is that still a bug? `gettimeofday` is thread safe acccording to http://stackoverflow.com/questions/3220224/is-the-gettimeofday-function-thread-safe-in-linux — Klas Lindbäck, Aug 14 '15 at 10:33
adding threads is only useful when each thread will be blocked, for instance by waiting for a user input. In general, adding threads will slow a program significantly as there are then many more 'context switches' just between the threads and context switches take time to perform — user3629249, Aug 15 '15 at 04:15
having multiple cpu cores will not help with threaded execution. To take advantage of the multiple cores need to use some library calls, such as found in openmpi libraries — user3629249, Aug 15 '15 at 04:19

Klas Lindbäck · Answer 1 · 2015-08-14T10:31:19.290

3

It is rare that a multithreaded program scales perfectly with the number of threads. In your case you measured a speed-up factor of ca 0.9 (665/748) with 4 threads. That is not so good.

Here are some factors to consider:

The overhead of starting threads and dividing the work. For small jobs the cost of starting additional threads can be considerably larger than the actual work. Not applicable to this case, since the overhead isn't included in the time measurements.

"Random" variations. Your threads varied between 152 and 242. You should run the test multiple times and use either the mean or the median values.

The size of the test. Generally you get more reliable measurements on larger tests (more data). However, you need to consider how having more data affects the caching in L1/L2/L3 cache. And if the data is too large to fit into RAM you need to factor in disk I/O. Usually, multithreaded implementations are slower, because they want to work on more data at a time but in rare instances they can be faster, a phenomenon called super-linear speedup.

Overhead caused by inter-thread communication. Maybe not a factor in your case, since you don't have much of that.

Overhead caused by resource locking. Usually has a low impact on cpu utilization but may have a large impact on the total real time used.

Hardware optimizations. Some CPUs change the clock frequency depending on how many cores you use.

The cost of the measurement itself. In your case a change will be detected within 25 (100/4) iterations of the for loop. Each iteration takes but a few clock cycles. Then you call gettimeofday which probably costs thousands of clock cycles. So what you are actually measuring is more or less the cost of calling gettimeofday.

I would increase the number of values to check and the cost to check each value. I would also consider turning off compiler optimizations, since these can cause the program to do unexpected things (or skip some things entirely).

edited Aug 14 '15 at 10:31

answered Aug 14 '15 at 09:08

Klas Lindbäck

33,105
5
57
82

1

There is no speed-up. Each thread scans for a smaller part of the array but the average delay for change detection goes up – Giannis M. Aug 14 '15 at 09:15
1

Maybe you need a larger data set than 100 elements to benefit from multi-threading. What happens if you use N=10000? – Klas Lindbäck Aug 14 '15 at 09:58
There is a good speed-up: the total work increased but the total execution time is given by the slowest thread, so you must compare 665 with 242 as Klas did. Please don't confuse the effort with the elapsed time. Converting an application to the multi-thread model means that you can use better all your CPU parallelism. I would try to increase the number of threads to find which is the best value. Usually increasing the number of threads the total execution decrease till to a minimum, then it start to increase again when the context switch overhead become heavier. – cristian v Aug 14 '15 at 10:34
1

@christianv Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes. (This wasn't obvious, so I've added it to the question description) – Klas Lindbäck Aug 14 '15 at 10:41
@KlasLindbäck you are right about the total time. Also, when I increase N a lot, multithreaded does run better (but performance still decreases from `NTHREADS=3` to `NTHREADS=4`). I still don't understand the behavior for smaller N, should I mark your answer as accepted? – Giannis M. Aug 14 '15 at 10:53
For smaller N, the slowdown could be caused by cpu frequency decreasing (to prevent the cpu from overheating) or by caching issues (see doron's answer). You should choose the answer you think best answers your question. – Klas Lindbäck Aug 14 '15 at 10:58
@KlasLindbäck From my various tests I think doron's explanation is the most likely (no cpu throttling with my current setup). You both helped me understand the program's behavior thanks a lot! – Giannis M. Aug 14 '15 at 11:05

doron · Accepted Answer · 2015-08-14T17:19:30.963

Short Answer You are sharing memory between thread and sharing memory between threads is slow.

Long Answer Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.

Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.

Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.

Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.

Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.

Aside Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.

Why is the multithreaded version of this program slower?

2 Answers2

Linked