0

I have some code that is trying to run some intense matrix processing, so I thought it would be faster if I multithreaded it. However, what my intention is is to keep the thread alive so that it can be used in the future for more processing. Here is the problem, the multithreaded version of the code runs slower than a single thread, and I believe the problem lies with the way I signal/keep my threads alive.

I am using pthreads on Windows and C++. Here is my code for the thread, where runtest() is the function where the matrix calculations happen:

void* playQueue(void* arg)
{
    while(true)
    {
        pthread_mutex_lock(&queueLock);
        if(testQueue.empty())
            break;
        else
            testQueue.pop();
        pthread_mutex_unlock(&queueLock);
        runtest();
    }
    pthread_exit(NULL); 
}

The playQueue() function is the one passed to the pthread, and what I have as of now, is that there is a queue (testQueue) of lets say 1000 items, and there are 100 threads. Each thread will continue to run until the queue is empty (hence the stuff inside the mutex).

I believe that the reason the multithread runs so slow is because of something called false sharing (i think?) and my method of signaling the thread to call runtest() and keeping the thread alive is poor.

What would be an effective way of doing this so that the multithreaded version will run faster (or at least equally as fast) as an iterative version?

HERE IS THE FULL VERSION OF MY CODE (minus the matrix stuff)

# include <cstdlib>
# include <iostream>
# include <cmath>
# include <complex>
# include <string>
# include <pthread.h>
# include <queue>

using namespace std;

# include "matrix_exponential.hpp"
# include "test_matrix_exponential.hpp"
# include "c8lib.hpp"
# include "r8lib.hpp"

# define NUM_THREADS 3

int main ( );
int counter;
queue<int> testQueue;
queue<int> anotherQueue;
void *playQueue(void* arg);
void runtest();
void matrix_exponential_test01 ( );
void matrix_exponential_test02 ( );
pthread_mutex_t anotherLock;
pthread_mutex_t queueLock;
pthread_cond_t queue_cv;

int main ()

{
    counter = 0;

   /* for (int i=0;i<1; i++)
        for(int j=0; j<1000; j++)
        {
            runtest();
          cout << counter << endl;
        }*/

    pthread_t threads[NUM_THREADS];
    pthread_mutex_init(&queueLock, NULL);
    pthread_mutex_init(&anotherLock, NULL);
    pthread_cond_init (&queue_cv, NULL);
    for(int z=0; z<1000; z++)
    {
        testQueue.push(1);
    }
    for( int i=0; i < NUM_THREADS; i++ )
    {
       pthread_create(&threads[i], NULL, playQueue, (void*)NULL);
    }
    while(anotherQueue.size()<NUM_THREADS)
    {

    }
    cout << counter;
    pthread_mutex_destroy(&queueLock);
    pthread_cond_destroy(&queue_cv);
    pthread_cancel(NULL);
    cout << counter;
    return 0;
}

void* playQueue(void* arg)
{
    while(true)
    {
        cout<<counter<<endl;
        pthread_mutex_lock(&queueLock);
        if(testQueue.empty()){
                pthread_mutex_unlock(&queueLock);
            break;
        }
        else
            testQueue.pop();
        pthread_mutex_unlock(&queueLock);
        runtest();
    }
    pthread_mutex_lock(&anotherLock);
    anotherQueue.push(1);
    pthread_mutex_unlock(&anotherLock);
    pthread_exit(NULL);
}

void runtest()
{
      counter++;
      matrix_exponential_test01 ( );
      matrix_exponential_test02 ( );
}

So in here the "matrix_exponential_tests" are taken from this website with permission and is where all of the matrix math occurs. The counter is just used to debug and make sure all the instances are running.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
G Boggs
  • 381
  • 1
  • 6
  • 19
  • 3
    100 threads? How many cores do you have? – dohashi Sep 10 '14 at 18:50
  • @dohashi 4 cores, I have a Core i7, but regardless, I have tried it with fewer threads (as low as 4), and it still runs significantly slower – G Boggs Sep 10 '14 at 18:54
  • It is going to be hard to help without knowing more about `runtest`. It is lock-free? How fast is an average call? – dohashi Sep 10 '14 at 19:00
  • Did you *profile your code*? – nneonneo Sep 10 '14 at 19:03
  • @dohashi Runtest does not have any locks, it simply does matrix mathematics. – G Boggs Sep 10 '14 at 19:08
  • @nneonneo I have not, what would you suggest would be the best way to do so? – G Boggs Sep 10 '14 at 19:09
  • Depending on your system, you can use `Instruments` (OS X), `gcc -pg` and `gprof` (Linux), or Visual Studio's performance tools (Windows). – nneonneo Sep 10 '14 at 19:09
  • @nneonneo What should I be on the lookout for while profiling. So I know how I can alter my code to run. Intuitively I feel like the threaded version (if done correctly) should run much faster than the single threaded version if its only math being done. – G Boggs Sep 10 '14 at 19:17
  • @GBoggs: just look for stuff that seems to take unexpectedly large amounts of time. Obviously the math routines should dominate. If they don't, or some small math subroutine takes longer than it should, you may find your problem. – nneonneo Sep 10 '14 at 19:18
  • "*the multithreaded version of the code runs slower*" A little slower? A lot slower? How are you comparing? Slower for the program to do the same amount of work? Or each thread runs slower? – David Schwartz Sep 10 '14 at 19:27
  • Try running 3 threads rather than a hundred. You should never run more CPU-intensive threads than you have cores –  Sep 10 '14 at 19:31
  • @Arkadiy I have, still takes longer – G Boggs Sep 10 '14 at 19:40
  • @DavidSchwartz it (multithreaded) runs approximately 4 times slower than the single threaded – G Boggs Sep 10 '14 at 19:41
  • @GBoggs So you mean it takes four times longer for the program to do the same work? Or you mean each thread in the multithreaded program runs four times slower than the single thread would run? – David Schwartz Sep 10 '14 at 19:41
  • @DavidSchwartz Sorry, yes the whole program takes 4 times as long to run the same function 1000 times. – G Boggs Sep 10 '14 at 19:42
  • How long does it take to run `runtest` in single-threaded mode? –  Sep 10 '14 at 19:43
  • @GBoggs Give us enough code to reproduce the problem. There's a good chance you're dispatching/dividing work incorrectly. – David Schwartz Sep 10 '14 at 19:44
  • @DavidSchwartz question updated with code – G Boggs Sep 10 '14 at 20:05
  • Are timing your code using real time or CPU time? CPU time could show an increase whereas real time would show a decrease. – dohashi Sep 10 '14 at 20:15
  • @dohashi I'm looking at execution time, after the code finishes running – G Boggs Sep 10 '14 at 20:20
  • Looking at your code, `matrix_exponential_test01` and `matrix_exponential_test02`are being run multiple times in multiple threads. Where are they getting their data from? Are they accessing some global struct? Are you just re-running the same functions with the same data over and over again, in multiple threads? – dohashi Sep 10 '14 at 20:22
  • But are your measuring real (walk clock) time or CPU cycles executed? – dohashi Sep 10 '14 at 20:22
  • 3
    You need to fix the bugs you have. For example, you call `anotherQueue.size` without holding the appropriate mutex. Worse, you spin on the queue rather than using a condition variable or other appropriate form of synchronization. – David Schwartz Sep 10 '14 at 20:22
  • @dohashi I think it is creating new variables and not accessing global variables when it calls runtest(). But wouldnt that make the program memory intensive? and possibly cause it to slow down? – G Boggs Sep 10 '14 at 20:38
  • An a little bit OT comment: take a look at OpenMP library. It can make your life easier if you use it carefully :) – Michał Walenciak Sep 15 '14 at 08:31

1 Answers1

3

Doesn't it stuck ?

while(true)
{
    pthread_mutex_lock(&queueLock);
    if(testQueue.empty())
        break; //<----------------you break without unlock the mutex...
    else
        testQueue.pop();
    pthread_mutex_unlock(&queueLock);
    runtest();
}

The section between lock and unlock run slower than if it was in single thread.

mutexes are slowing you down. you should lock only the critical section, and if you want to speed it up, try not use mutex at all.

You can do it by supplying the test via function argument rather than use the queue.

one way to avoid using the mutex is to use a vector without deleting and std::atomic_int (c++11) as the index (or to lock only getting the current index and the increment)

or use iterator like this:

vector<test> testVector;
vector<test>::iterator it;
//when it initialized to:
it = testVector.begin();

now your loop can be like this:

while(true)
{
    vector<test>::iterator it1;
    pthread_mutex_lock(&queueLock);
    it1 = (it==testVector.end())? it : it++; 
    pthread_mutex_unlock(&queueLock);

    //now you outside the critical section: 
    if(it==testVector.end())
        break; 
    //you don't delete or change the vector
    //so you can use the it1 iterator freely
    runtest();
}
SHR
  • 7,940
  • 9
  • 38
  • 57
  • Well no, because at that point the queue is empty so the threads are finished running and the program will end (it ends so i can see the runtime). – G Boggs Sep 10 '14 at 19:05
  • 3
    @GBoggs: If the queue is empty, the threads will hang after the first one exits. The only reason your program doesn't get stuck is probably because you aren't bothering to wait for them to exit. – nneonneo Sep 10 '14 at 19:10
  • @nneonneo Youre right, I just fixed it, and now I've made sure they all close. But that still doesnt fix the main issue – G Boggs Sep 10 '14 at 19:15
  • ...for which you'd have to profile to find the bottleneck. Figure out how to do that and you'll be most of the way there. – nneonneo Sep 10 '14 at 19:16
  • 1
    @GBoggs see my addition, maybe it would help. – SHR Sep 10 '14 at 19:45
  • @SHR I liked the idea, but unfortunately i tested it and it sill runs as slow as before. – G Boggs Sep 10 '14 at 19:57
  • @GBogg Multi-threading is not always improves the performance. if your test run on 100% cpu, without waits, in a single thread, it usually won't be improved. – SHR Sep 10 '14 at 20:09
  • @SHR well its just confusing because when I ran the program a while ago with 100 threads and 100 instances, vs 100 iterations, the threaded version was about twice as fast. The only time it slowed down when I tried to keep the thread alive and running the function more times. – G Boggs Sep 10 '14 at 20:15
  • @GBogg if you really want to improve the performance I can suggest you to use a message queue(http://linux.die.net/man/7/mq_overview). each thread can handle it's own instance of the queue and no other synchronization is needed. it works also between processes. – SHR Sep 10 '14 at 20:25
  • @SHR is this available on Windows? Im not running on Linux – G Boggs Sep 10 '14 at 20:28
  • @GBoggs I don't think so, but maybe there are others. – SHR Sep 10 '14 at 20:37