Little performance increasing when using multiple threads

Question

I was implementing multithread Jordan-Gauss method of solving a linear system and I saw that running on two threads took only about 15% less time than running on single thread instead of ideal 50%. So I wrote a simple program reproducing this. Here I create a matrix 2000x2000 and give 2000/THREADS_NUM lines to each thread to make some calculations with them.

#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <time.h>

#ifndef THREADS_NUM
#define THREADS_NUM 1
#endif

#define MATRIX_SIZE 2000


typedef struct {
    double *a;
    int row_length;
    int rows_number;
} TWorkerParams;

void *worker_thread(void *params_v)
{
    TWorkerParams *params = (TWorkerParams *)params_v;
    int row_length = params->row_length;
    int i, j, k;
    int rows_number = params->rows_number;
    double *a = params->a;

    for(i = 0; i < row_length; ++i) // row_length is always the same
    {
        for(j = 0; j < rows_number; ++j) // rows_number is inverse proportional
                                         // to the number of threads
        {
            for(k = i; k < row_length; ++k) // row_length is always the same
            {
                a[j*row_length + k] -= 2.;
            }
        }
    }
    return NULL;
}


int main(int argc, char *argv[])
{
    // The matrix is of size NxN
    double *a =
        (double *)malloc(MATRIX_SIZE * MATRIX_SIZE * sizeof(double));
    TWorkerParams *params =
        (TWorkerParams *)malloc(THREADS_NUM * sizeof(TWorkerParams));
    pthread_t *workers = (pthread_t *)malloc(THREADS_NUM * sizeof(pthread_t));
    struct timespec start_time, end_time;
    int rows_per_worker = MATRIX_SIZE / THREADS_NUM;
    int i;
    if(!a || !params || !workers)
    {
        fprintf(stderr, "Error allocating memory\n");
        return 1;
    }
    for(i = 0; i < MATRIX_SIZE*MATRIX_SIZE; ++i)
        a[i] = 4. * i; // just an example matrix
    // Initializtion of matrix is done, now initialize threads' params
    for(i = 0; i < THREADS_NUM; ++i)
    {
        params[i].a = a + i * rows_per_worker * MATRIX_SIZE;
        params[i].row_length = MATRIX_SIZE;
        params[i].rows_number = rows_per_worker;
    }
    // Get start time
    clock_gettime(CLOCK_MONOTONIC, &start_time);
    // Create threads
    for(i = 0; i < THREADS_NUM; ++i)
    {
        if(pthread_create(workers + i, NULL, worker_thread, params + i))
        {
            fprintf(stderr, "Error creating thread\n");
            return 1;
        }
    }
    // Join threads
    for(i = 0; i < THREADS_NUM; ++i)
    {
        if(pthread_join(workers[i], NULL))
        {
            fprintf(stderr, "Error creating thread\n");
            return 1;
        }
    }
    clock_gettime(CLOCK_MONOTONIC, &end_time);
    printf("Duration: %lf msec.\n", (end_time.tv_sec - start_time.tv_sec)*1e3 +
            (end_time.tv_nsec - start_time.tv_nsec)*1e-6);
    return 0;
}

Here how I compile it:

gcc threads_test.c -o threads_test1 -lrt -pthread -DTHREADS_NUM=1 -Wall -Werror -Ofast
gcc threads_test.c -o threads_test2 -lrt -pthread -DTHREADS_NUM=2 -Wall -Werror -Ofast

Now when I run I get:

./threads_test1
Duration: 3695.359552 msec.
./threads_test2
Duration: 3211.236612 msec.

Which means 2-thread program runs 13% faster than single-thread, even though there is no synchronization between threads and they don't share any memory. I found this answer: https://stackoverflow.com/a/14812411/5647501 and thought that here may be some issues with processor cache, so I added padding, but still result remained the same. I changed my code as follows:

typedef struct {
    double *a;
    int row_length;
    int rows_number;
    volatile char padding[64 - 2*sizeof(int)-sizeof(double)];
} TWorkerParams;

#define VAR_SIZE (sizeof(int)*5 + sizeof(double)*2)
#define MEM_SIZE ((VAR_SIZE / 64 + 1) * 64  )
void *worker_thread(void *params_v)
{
    TWorkerParams *params = (TWorkerParams *)params_v;
    volatile char memory[MEM_SIZE];
    int *row_length  =      (int *)(memory + 0);
    int *i           =      (int *)(memory + sizeof(int)*1);
    int *j           =      (int *)(memory + sizeof(int)*2);
    int *k           =      (int *)(memory + sizeof(int)*3);
    int *rows_number =      (int *)(memory + sizeof(int)*4);
    double **a        = (double **)(memory + sizeof(int)*5);

    *row_length = params->row_length;
    *rows_number = params->rows_number;
    *a = params->a;

    for(*i = 0; *i < *row_length; ++*i) // row_length is always the same
    {
        for(*j = 0; *j < *rows_number; ++*j) // rows_number is inverse proportional
                                         // to the number of threads
        {
            for(*k = 0; *k < *row_length; ++*k) // row_length is always the same
            {
                (*a + *j * *row_length)[*k] -= 2. * *k;
            }
        }
    }
    return NULL;
}

So my question is: why do I get only 15% speed-up instead of 50% when using two threads here? Any help or suggestion will be appreciated. I am running 64-bit Ubuntu Linux, kernel 3.19.0-39-generic, CPU Intel Core i5 4200M (two physical cores with multithreading), but I also tested it on two other machines with the same result.

EDIT: If I replace a[j*row_length + k] -= 2.; with a[0] -= 2.;, I get expected speed-up:

./threads_test1
Duration: 1823.689481 msec.
./threads_test2
Duration: 949.745232 msec.

EDIT 2: Now, when I replaced it with a[k] -= 2.; I get the following:

./threads_test1
Duration: 1039.666979 msec.
./threads_test2
Duration: 1323.460080 msec.

This one I can't get at all.

I'm voting to close this question as off-topic because this sound more like a question for code-review. — too honest for this site, Dec 07 '15 at 15:49

EmDroid · Answer 1 · 2015-12-07T18:04:31.430

7

This is a classic issue, switch the i and j for loops.

You are iterating through columns first and in the inner loop you process rows, that means you have much more cache misses than necessary.

My results with the original code (the first version without padding):

$ ./matrix_test1
Duration: 4620.799763 msec.
$ ./matrix_test2
Duration: 2800.486895 msec.

(better improvement than yours actually)

After switching the for loops for i and j:

$ ./matrix_test1
Duration: 1450.037651 msec.
$ ./matrix_test2
Duration: 728.690853 msec.

Here the 2-times speedup.

EDIT: In the fact the original is not that bad because the k index still goes through the row iterating columns, but is is still much better to iterate the row in the outer loop. And when the i rises, you are processing less and less items in the most inner loop, so it still matters.

EDIT2: (removed the block solution because it was actually producing different results) - but it still should be possible to utilize blocks to improve cache performance.

edited Dec 07 '15 at 18:04

answered Dec 07 '15 at 16:33

EmDroid

5,918
18
18

Can it be the difference between our machines? Because after switching the loops for i and j I get the following:./threads_test1 Duration: 1048.321083 msec. ./threads_test2 Duration: 1012.153498 msec. – Matvey Dec 07 '15 at 16:43
Did you really just switched the loops like this: for(j = 0; j < rows_number; ++j) { for(i = 0; i < row_length; ++i) { or did you also exchange the index variables i and j? Maybe you forgot to exchange the index variables in the most inner loop as well? – EmDroid Dec 07 '15 at 16:49
Try just to take your first code in the question and just move the "for (j ..." statement above the "for (i ..." statement, do not exchange the variables yet. – EmDroid Dec 07 '15 at 16:59
Yes, I really just switched the loops like you said, just switched two lines "for(j...)" and "for(i...)". What do you mean by exchanging the index variables? I don't think I can exchange them as their meaning remains the same after I switch loops, and moreover, isn't the point in switching loops leaving indexes? – Matvey Dec 07 '15 at 17:08
Hum then it is really strange because you times 1-thread vs 2-threads are now very close (much better than the original times though, but still the ratio 1 vs 2 threads is now much worse, that is really strange). Can you try on other machine as well? – EmDroid Dec 07 '15 at 17:17
Don't have in right now, I'll try later and write results here. – Matvey Dec 07 '15 at 17:19
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/97220/discussion-between-matvey-and-e-maskovsky). – Matvey Dec 07 '15 at 18:23

score 1 · Answer 2 · answered Dec 07 '15 at 15:54

1

Do you speak about 13 % of speed up, but what is the time elapsed on your calculus fonction and not in the rest of programm.

You could start to estimate only the time passed on the calcul method without the time of thread management. It's possible that you lose an important part of your time in the thread managmement. That's could explain the small speed up that you obtained.

In other part, 50% of speed up with 2 threads it's quite impossible to obtain.

answered Dec 07 '15 at 15:54

marcS

96
1
4

Thank you for your reply. I tried to increase MATRIX_SIZE to 3000, and still I have 24 and 21 seconds. I don't think that managing threads here (create 2 threads and join them) would take so much time – Matvey Dec 07 '15 at 16:00

Little performance increasing when using multiple threads

2 Answers2