2

I am trying to parallelize a code using OpenMP, the serial time for my current input size is around 9 seconds, I have a code of the following form:

int main()
{
    /* do some stuff*/ 
    myfunction();
}

void myfunction()
{
    for (int i=0; i<n; i++)
    {
        //it has some parameters but that is beyond the point I guess
        int rand = custom_random_generator();
        compute(rand);
    }
}

so here the random generator can be executed in parallel since there are no dependencies, and the same goes for the compute function so I was attempting to parallel this piece but all my attempts resulted in a failure, the first thought was to put these functions as task so they get executed in parallel but resulted in a slower result, here is what I did

void myfunction()
{
    for (int i=0; i<n; i++)
    {
        #pragma omp task 
        {
            //it has some parameters but that is beyond the point I guess
            int rand=custom_random_generator();
            compute(rand);
        }
    }
}

Result: 23 seconds, more than double the serial time

Putting task on compute() only resulted in the same

Even worse attempt:

void myfunction()
{
    #pragma omp parallel for
    for (int i=0; i<n; i++)
    {
        //it has some parameters but that is beyond the point I guess
        int rand=custom_random_generator();
        compute(rand);
    }
}

Result: 45 seconds

Theoretically speaking, why could this happen? I know that for anyone to tell my exact problem they would need a minimum reproducible example but my goal from the question is to understand the different theories that could explain my problem and apply them myself, why would parallelizing an "embarrassingly parallel" piece of code result in way worse performance?

anastaciu
  • 23,467
  • 7
  • 28
  • 53
Sergio
  • 275
  • 1
  • 15
  • Something else is causing all this overhead cannot only be the thread creation, I would suspect false sharing, cache invalidation and so on, how many cores do you have? – dreamcrash May 01 '21 at 09:52
  • I have a macbook air 2019 with intel core i5 so I have 2 cores but run 4 threads in parallel as it supports hyperthreading @dreamcrash – Sergio May 01 '21 at 09:54
  • Can you tell me the times with #pragma omp parallel for num_threads(1), then num_threads(2) ? – dreamcrash May 01 '21 at 09:55
  • @dreamcrash in which configuration? I mentioned 2 solutions, I am not sure if you mean "#pragma omp parallel" for num_threads(1) or "#pragma omp parallel for" with num_therads(1) – Sergio May 01 '21 at 09:58
  • The second one with #pragma omp parallel for – dreamcrash May 01 '21 at 10:00
  • I got 9 seconds for 1 thread and 28 seconds for 2, note that it is 45 seconds for 4 threads – Sergio May 01 '21 at 10:10
  • Something is wrong in your code and with the currently example is impossible to tell – dreamcrash May 01 '21 at 10:45
  • 1
    No problem, look for race conditions shared states this include external function calls – dreamcrash May 01 '21 at 10:53
  • `custom_random_generator()` does not have any parameter and is supposed to return a random value. Thus, this means there is an *implicit state* (eg. global variable) and this state is likely *shared* between your threads. You could put your state thread-local to avoid race any conditions. Can you tell us more about that or at least precise this point? (not that race condition are not only important for performance here because the result may not actually be random due to that). – Jérôme Richard May 01 '21 at 20:38
  • 1
    Your task code has no parallelism! (There is no parallel directive anywhere in the code). You may also want to look at papers on parallel random number generation, it is not a trivial task. (E.g. Parallel Random Numbers: As Easy as 1, 2, 3 - The Salmons http://www.thesalmons.org/john/random123/papers/random123sc11.pdf ) – Jim Cownie May 03 '21 at 08:28

1 Answers1

1

One theory could be the overhead that is associated with creating and maintaining multiple threads.

The advantges of parallel programming can only be seen when each iteration has to perform more complicated processor intensive tasks.

A simple for loop with some simple routine inside would not take advantage of it.

anastaciu
  • 23,467
  • 7
  • 28
  • 53
  • my random generator is pretty complex but my compute function just returns the result of a simple equation, I think I will try to only parallelize it then – Sergio May 01 '21 at 09:29
  • task failed successfully! It improved the time to 18 seconds but it is still 2 times worse than the serial but I guess that confirms that task creating is creating overhead – Sergio May 01 '21 at 09:42
  • Something else is causing all this overhead cannot only be the thread creation, I would suspect false sharing, cache invalidation and so on – dreamcrash May 01 '21 at 09:53
  • @Sergio great throwback, windows still has some gems though, [check the return for winAPI successful function calls](https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-). – anastaciu May 01 '21 at 09:54
  • @dreamcrash, I also think that that would be too much, but it's hard to tell. I'll maybe add that to my answer. When I saw the openmp tag I thought you'd show up ;) – anastaciu May 01 '21 at 09:57
  • 2
    :D in all fairness I think the problem is that OP example does not accurate represent the code where the parallelizations, hence one can only speculate – dreamcrash May 01 '21 at 09:59