I'm converting a C program to multithreading version. The code is too long to post here but the approach I was using is quite simple. There is a pipeline in the original program that has four programs. The output of each of the programs becomes the input of the next one. What I did was create one thread for each of the four programs to make a task pipelining using pthread. The machine I was using is a 16 cores server. I'm getting the correct result but the problem is the performance is getting worse. When I was debugging it, I found the weirdest thing. Even running just some one line of code with the same data, the timing is different. For example, there is one line of code in the program like below
mtx[i][j][d] = max(mtx[i][j][d], mtx[i-2][j-1][d-1] + t[offset]);//max is a macro defined to find the max of two values
which is simply a three-dimensional dynamic programming computation.
Because all data is the same and it is not in any Critical Section, I was really confused what could be the cause. Could it be the caching problem because it is Shared-Memory machine?