0

I'm converting a C program to multithreading version. The code is too long to post here but the approach I was using is quite simple. There is a pipeline in the original program that has four programs. The output of each of the programs becomes the input of the next one. What I did was create one thread for each of the four programs to make a task pipelining using pthread. The machine I was using is a 16 cores server. I'm getting the correct result but the problem is the performance is getting worse. When I was debugging it, I found the weirdest thing. Even running just some one line of code with the same data, the timing is different. For example, there is one line of code in the program like below

mtx[i][j][d] = max(mtx[i][j][d], mtx[i-2][j-1][d-1] + t[offset]);//max is a macro defined to find the max of two values

which is simply a three-dimensional dynamic programming computation.

Because all data is the same and it is not in any Critical Section, I was really confused what could be the cause. Could it be the caching problem because it is Shared-Memory machine?

Jimmy
  • 63
  • 2
  • 9
  • 3
    It's impossible to say without seeing your code. For example, we don't know how you have parallelised your pipeline. Obviously there are dependencies across the pipelines. So if you have not coded it correctly you may be just serialising the computation. Also, there are general reasons why just throwing more threads at a problem does not necessarily improve performance. That same question has been asked many times - see for example: [Why is the multithreaded version of this program slower?](https://stackoverflow.com/questions/32005767/why-is-the-multithreaded-version-of-this-program-slower) – kaylum Feb 17 '16 at 05:56
  • 1
    @kaylum I agree that without posting the code, it's impossible to know if the entire pipelining version is correctly coded. But are there more specific reasons that could cause the problem I mentioned? Because with only one line of code, why did I get different running time? It was just killing me. – Jimmy Feb 17 '16 at 06:04
  • @kaylum So basically from the post you were sharing, sharing memory between threads makes the program run slow. I think I can start with that. Thanks. – Jimmy Feb 17 '16 at 06:08
  • 2
    There are other factors. For example if each thread just executes one line of code for a small data set then a non threaded program which executes all those same lines of code will almost certainly perform better than the multi thread program. Because there is overhead in creating and synchronizing the threads. – kaylum Feb 17 '16 at 06:12
  • @kaylum I understood the overhead of creating and synchronizing the threads. I wasn't using threads to just execute one line of code and then synchronize. I'll check my code again and see if I can find any other problems. Thanks for the suggestions. – Jimmy Feb 17 '16 at 06:17
  • 1
    Also have a look at what scheduling policy the OS uses for multiple threads. If each thread has the same priority for example then they might be executing one after the other which could slow down your program. – jigglypuff Feb 17 '16 at 06:17
  • ^ This. Allocating threads is pretty expensive - a property that is pretty universal across most systems. If you're doing lots of tiny calculations in individual threads, usually the overhead is close to, or in some cases greater than, the actual work being done. How many threads are you spawning? You should only have one, maybe two per CPU core. Are you pooling threads and sleeping/waking them as you're going through your iterations? If you're allocating new threads each time you start a calculation, I would suspect that is your problem. – Qix - MONICA WAS MISTREATED Feb 17 '16 at 06:18
  • @Qix I created four threads including the master thread with each one running one function logically. I created four work queues to pass the data between threads and used semaphores to sleep/wake them. I don't think the creation of threads is the problem but maybe the communications will be. – Jimmy Feb 17 '16 at 06:29
  • @Martinn The policy is TS. From this link - https://www-01.ibm.com/support/knowledgecenter/SSSTCZ_2.0.0/com.ibm.rt.doc.20/realtime/lnx_schedule.html, it says with TS, "each thread runs for a limited time period, after which the next thread is allowed to run." Could TS slow down the program? – Jimmy Feb 17 '16 at 06:31
  • Does each program produce its output all at once? And does each program require its full input before it can begin processing? If the answer to either of these questions is "yes", multithreading probably won't help you. – David Schwartz Feb 17 '16 at 07:31
  • @Jimmy This may affect the performance and will largely depend on what the threads are doing and whether or not they are killed as soon as they finish their work. Have a go at experimenting with other scheduling policies, just remember to set the priority when using FIFO and RR in the link you gave. – jigglypuff Feb 18 '16 at 02:14
  • @DavidSchwartz Yes, each program needs to wait for the previous one to be completed. Why won't multithreading improve the performance in this case? Each program can still run in parallel just like the CPU pipelining so I think it won't slow the program down unless the data is too small because of the thread creation and join overhead. Let me know if I am wrong, thanks. – Jimmy Feb 20 '16 at 12:10
  • @Martinn I will look into that to see if it can help, thanks. – Jimmy Feb 20 '16 at 12:10
  • @Jimmy If each program waits for the previous one to complete, how will they run in parallel? – David Schwartz Feb 21 '16 at 02:19
  • @DavidSchwartz When each program finishes, it triggers the next one and will start to run concurrently if its input from the previous one is ready. Just like Instruction Pipelining in CPU - https://en.wikipedia.org/wiki/Instruction_pipelining. If there are N stages, theoretically it can speed up the execution to approximately 1/N, although it always gets worse. This is my understanding of Parallelism with Pipelining. – Jimmy Feb 21 '16 at 05:48
  • @Jimmy That makes no sense. If each one cannot start until the previous one finishes, no overlap is possible. – David Schwartz Feb 21 '16 at 07:02
  • @DavidSchwartz Well, because the first one doesn't need wait so immediately when it finishes running on the previous data, it can start with a new one. So it's possible that the first program is in execution on the new data in parallel with the second program running on the previous data. This is how classic RISC pipelining improve the CPU instruction throughput using pipelining approach, where Instruction Fetching, Decoding, Execution, Memory Stage and Writing Back can run in parallel within one clock. – Jimmy Feb 21 '16 at 14:53

0 Answers0