2

I want to compute the sum of a big matrix and I'm currently seeing no performance improvements when I use multiple threads or just a single one. I think the problem is relating to false sharing but I also added a padding to my struct. Please have a look!

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>

#define WIDTH 20000 
pthread_mutex_t mylock = PTHREAD_MUTEX_INITIALIZER;

struct split { // sizeof(split) = 24 
    int start; 
    int end; 
    int* matrix; 
    int i; 
    char padding[64 - 24]; //Padding the private sum variables     forces them into separate cache lines and removes false sharing. Assume cache line is 64 bytes
};

int ran(){ 
    return rand() % 21; 
}
int* createBigMatrix(){
    int* a = malloc(sizeof(int)* WIDTH * WIDTH);
    for (int i = 0; i < WIDTH * WIDTH; i ++){ 
        a[i] = ran(); // fill up the matrix with random numbers
    }
    return a;
}
static int finalSum;
void* partialSum(void* arg){ 
    struct split* a = arg;
    int totalSum = 0; // create local variable
    int i;
    for (i = a->start; i <= a->end; i ++){  
        totalSum += a->matrix[i];
    }
    pthread_mutex_lock(&mylock);
    finalSum += totalSum; // critical section
    pthread_mutex_unlock(&mylock);  
    free(a);

    return 0;
} 
int main(){ //-294925289
    int useMultiThreads = 1; // there is no difference between using one thread or 4 therads
    finalSum = 0;
    pthread_t thread_ids[4];  
    // i want a square matrix of npages width 
    int* c = createBigMatrix();  

    printf("%lu\n", sizeof(struct split));
    if (useMultiThreads){
        // split the tasks evenly amoung 4 threads
        // since there are 20,000x20,000, there must be 400,000,000 cells 
        int start[] = {0, 100000000, 200000000, 300000000};
        int end[] = {99999999, 199999999, 299999999, 399999999}; 
        // calculate sum
        for (int i = 0; i < 4; i ++){
            struct split* a = malloc(sizeof(struct split));
            a->start = start[i];
            a->end = end[i];
            a->matrix = c;
            pthread_create(thread_ids + i, NULL, partialSum, a);
        }

        for (int i = 0; i < 4; i ++){ // join em up
            pthread_join(thread_ids[i], NULL);
        }
    }
    else { // use single thread
        for (int i = 0; i <= 399999999; i ++){
            finalSum += c[i];
        }
    }

    printf("total sum is %d\n", finalSum);
/*
    real    0m4.871s
    user    0m4.844s
    sys     0m0.392s
*/ 
    free(c);
    return 0;
}
Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177
fatffatable
  • 399
  • 3
  • 11
  • 3
    There does not seem to be much scope for false sharing since the matrix indices used by the threads do not overlap and, anyway, padding the parameter struct would not help. How are you measuring the time taken for the sum? It would seem to me that that overall performance of this process would be way dominated by creating and loading the huge array before the summing starts at all? – Martin James May 29 '16 at 06:36
  • 1
    Be carefull with your indices, `int` is definitively not the correct type for large matrices as this. Also factor the use of `a->` out of your `for` loop. The compiler can't know if `*a` may change under the hood, so he has to reload at each iteration. You could change `a` to be `restrict` qualified, but simpler would just be to load the values (bounds and matrix) into local variables and use them inside the loop. – Jens Gustedt May 29 '16 at 07:53

1 Answers1

0

I don't see whatsoever the padding of your struct should have to do with the performance of your code. The real data is in the matrix that is pointed to.

To what is your concern, the lack of speedup, this is probably due to the fact that your code is completely memory bound. That is, to perform the sum the data must be fetched from memory through the memory bus. (Your matrix is far too large to fit in cache.) That is, your computation is bound the bandwidth of your memory bus, which is shared by all your cores.

Also notice, that your code is not dominated by doing the sum, but by the calls to ran() that are in the sequential part of the program.

Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177