Proper / Efficient parallelization of a for loop with OpenMP

Question

I have developed a distributed memory MPI application which involves processing of a grid. Now i want to apply shared memory techniques (essentially making it a hybrid - parallel program), with OpenMP, to see if it can become any faster, or more efficient. I'm having a hard time with OpenMP, especially with a nested for loop. My application involves printing the grid to the screen every half a second, but when i parallelize it with OpenMP, execution proceeds 10 times slower, or not at all. The console screen lags and refreshes itself with random / unexpected data. In other words, it is going completely wrong. Take a look at the following function, which does the printing:

void display2dGrid(char** grid, int nrows, int ncolumns, int ngen)
{
    //#pragma omp parallel
    updateScreen();
    int y, x;
    //#pragma omp parallel shared(grid)      // garbage
    //#pragma omp parallel private(y)        // garbage output!
    //#pragma omp for
    for (y = 0; y < nrows; y++) {
        //#pragma omp parallel shared(grid)  // nothing?
        //#pragma omp parallel private(x)    // 10 times slower!
        for (x = 0; x < ncolumns; x++) {
            printf("%c ", grid[y][x]);
        }
        printf("\n");
    }
    printf("Gen #%d\n", ngen);
    fflush(stdout);
}

(updateScreen() just clears the screen and writes from top left corner again.)

The function is executed by only one process, which makes it a perfect target for thread parallelization. As you can see i have tried many approaches and one is worse than the other. Best case, i get semi proper output every 2 seconds (because it refreshes very slowly). Worst case i get garbage output.

I would appreciate any help. Is there a place where i can find more information to proper parallelize loops with OpenMP? Thanks in advance.

Do I understand correctly ? You expect multiple threads to write (sort of) simultaneously to the console and to not get junk ? You expect multiple threads to contend for access to a single channel to write to the screen and for this to zip along merrily ? — High Performance Mark, Jul 30 '17 at 00:36
If you use a serial random number generator that also defeats parallelism. — tim18, Jul 30 '17 at 01:45

score 3 · Accepted Answer · answered Jul 30 '17 at 11:20

The function is executed by only one process, which makes it a perfect target for thread parallelization.

That is actually not true. The function you are trying to parallelize is a very bad target for parallelization. The calls to printf in your example need to happen in a specific sequential order, or else, you're going to obtain a garbage result as your experienced (since the elements of your grid are going to be printed in an order that means nothing). Actually, your attempts at parallelizing were pretty good, the problem comes from the fact that the function itself is a bad target for parallelization.

Speedup when parallelizing programs comes from the fact that you are distributing workload across multiple cores. In order to be able to do that with maximum efficiency, said workloads need to be independent, or at least share state as little as possible, which is not the case here since the calls to printf need to happen in a specific order.

When you try to parallelize some work that is intrinsically sequential, you lose more time synchronizing your workers (your openmp threads), than you gain by parallizing the work itself (which is why you obtain crap time when your result gets better).

Also, as this answer (https://stackoverflow.com/a/20089967/3909725) suggests, you should not print the content of your grid at each loop (unless you are debugging), but rather perform all of your computations, and then print the content when you have finished doing what your ultimate goal is, since printing is only useful to see the result of the computation, and only slows the process.

An example :

Here is a very basic example of parallizing a program with openmp that achieves speedup. Here a dummy (yet heavy) computation is realized for each value of the i variable. The computations in each loop are completely independent, and the different threads can achieve their computations independently. The calls to printf can be achieved in whatever order since they are just informative.

Original (sequential.c)

#include <math.h>
#include <stdio.h>
#include <stdlib.h>


int main()
{
  int i,j;
  double x=0;

  for(i=0; i < 100; i++)
    {
      x = 100000 * fabs(cos(i*i));
      for(j=0;j<100+i*20000;j++)
        x += sqrt(sqrt(543*j)*fabs(sin(j)));
      printf("Computed i=%2d [%g]\n",i,x);
    }
}

Parallelized version (parallel.c)

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main()
{
  int i,j;
  double x=0;
#pragma omp parallel for
  for(i=0; i < 100; i++)
    {
      /* Dummy heavy computation  */
      x = 100000 * fabs(cos(i*i));
      #pragma omp parallel for reduction(+: x)
      for(j=0;j<100+i*20000;j++)
        x += sqrt(sqrt(543*j)*fabs(sin(j)));

      printf("Thread %d computed i=%2d [%g]\n",omp_get_thread_num(),i,x);
    }
}

A pretty good guide to openmp can be found here : http://bisqwit.iki.fi/story/howto/openmp/

Proper / Efficient parallelization of a for loop with OpenMP

1 Answers1

An example :