3

I've made a matrix multiplier piece of code designed to quantify how much faster threading is for each size of matrix. The code where threads are made and executed is below. I'm new to threading but whenever I use threading it seems to take about 12 times longer, am I doing something wrong or does anyone know why it's so much slower?

Thanks,

void *vectorMultiply(void *arguments){
    struct vectorV *args = arguments;

    int sum = 0;
    for (int iii = 0; iii < args->n; iii++) {
        sum = sum + args->first[args->i][iii] * args->second[iii][args->ii];
    }
    args->out[args->i][args->ii] = sum;
}

void singleThreadVectorMultiply(int n,int i,int ii, int 
                                **first, int **second, int **out){
    int sum = 0;
    for (int iii = 0; iii < n; iii++) {
        sum = sum + first[i][iii] * second[iii][ii];
    }
    out[i][ii] = sum;
}

void multiplyMatrix(int n, int** first, int** second, int** out){
    pthread_t tid[n][n]; 
    struct vectorV values[n][n];
    for (int i = 0; i < n; i++) {
        for (int ii = 0; ii < n; ii++) {
            if(!SINGLETHREAD){
                values[i][ii].n=n;
                values[i][ii].i=i;
                values[i][ii].ii=ii;
                values[i][ii].first = first;
                values[i][ii].second=second;
                values[i][ii].out=out;
                pthread_create(&tid[i][ii], NULL, vectorMultiply, 
                              (void *)&values[i][ii]); 
            }
            else{
                clock_t time; 
                time = clock(); 
                singleThreadVectorMultiply(n,i,ii,first,second,out);
                time = clock() - time; 
                totalTime+=time;
            }

        }
    }
    if(!SINGLETHREAD){
        clock_t time; 
        time = clock(); 
        for(int i=0; i < n; i++)
            for(int ii=0; ii < n; ii++)
                pthread_join( tid[i][ii], NULL); 
        time = clock() - time; 
        totalTime+=time;
    }
}
klutt
  • 30,332
  • 17
  • 55
  • 95
hamish sams
  • 433
  • 4
  • 9
  • 2
    The thread design doesn't seem to make any sense. If `n` is small, your threads do almost no work, so you're adding pure overhead. If `n` is large, your threads will be fighting for the CPU – ikegami Dec 09 '19 at 11:57
  • You might need to take the pthread creation in account too. In case of strange benchmarking results: question the benchmarking. Then question it again. – Lundin Dec 09 '19 at 11:59
  • 4
    In effect, you are creating `n*n` threads here, each doing a single vector multiplication. This makes no sense. You have to create `c` threads once, with c being the number of CPU cores and distribute the work as equally as possible to these c threads. Then, merge the results in the main thread – Ctx Dec 09 '19 at 12:04
  • 2
    As have been pointed out, the design does not make sense. Parallelizing is not a silver bullet that automagically makes things faster. In the case of a matrix multiplication, the biggest issue is typically to make sure that the relevant data is in the cache. If you cannot solve that issue, then there's not point in parallelizing at all. For large matrices, it's not the cpu operations that takes time. It's fetching the data from memory. You can easily do 100 multiplications in the time it takes to fetch data from ram. – klutt Dec 09 '19 at 12:13
  • I completely understand this code isn't clever in practicality but it is being used to prove such. I'm trying to show that the "Big Oh" analysis style where creating n^2 threads would take the system to an overall n instead of n^3. Obviously this doesn't work without n^2 cores but I'm only timing the compute time, not the overheads, surely it shouldn't take 12 times as long? – hamish sams Dec 09 '19 at 12:17
  • @hamishsams That depends on the choice of `n`. I am sure, you will find a (higher) value for `n`, where the threaded version is 100 times slower than the single threaded. – Ctx Dec 09 '19 at 12:22
  • @hamishsams I'm not surprised at all that it takes 12 times longer. That's not strange at all. Fetching data from main memory typically takes 200 clock cycles, and your code seem to be jumping back and forth all the time. – klutt Dec 09 '19 at 12:23
  • @hamishsams I once wrote some code that created a matrix according to some specs. I realized that the matrix was symmetric across the diagonal, so I used it to reduce the amount of calculations to 50%. For big matrices it was, much much slower. The reason? Cache! – klutt Dec 09 '19 at 12:25
  • How are you “timing the compute time, not the overheads”? How do you have any sort of measurement that a load instruction spent only a little time in an active state in the processor and a lot of time waiting for memory to respond? – Eric Postpischil Dec 09 '19 at 12:51
  • *each of those 25 – ikegami Dec 09 '19 at 17:39

0 Answers0