0

I'm modifying existing library from single thread to multi threading. I have code like a provided below. I can't understand how to declare arrays x, y, array1, array2. Which of them I should declare as share or threadprivate. Do I need use flush. If yes in which case ?

//global variables
static int array1[100000];
static int array2[100000]; 

//part of program code from one of function. 
int i
int x[1000000];
int y[1000000];

#pragma omp parallel for  
for(i=0, i<100; i++)
{
  y[i]  = i*i-3*i-10*random();
  x[i] = myfunc(i, y[i])
}

//additional function
int myfunc(j, z)
int j,
int z[]
{
  array1[array2[j]] += z[j]+j;
  return array1[j];
}
Yuriy Tigiev
  • 177
  • 1
  • 15
  • 1
    Can you please clean-up your code snippet a little bit. I just don't understand what is global, what does this `for` loop do in the middle of nowhere, etc. And are you seriously using K&R style function declaration? – Gilles Mar 02 '16 at 13:31
  • K&R style function used by owner of this library. I just trying modify source code. Provided code is example. I would like use this example to understand rules for choosing right type of variables: shared, private or threadprivate. – Yuriy Tigiev Mar 02 '16 at 13:55
  • What or who is still using K&R C except ancient functions such as [div](https://github.com/Xilinx/eglibc/blob/master/stdlib/div.c)? – Z boson Mar 03 '16 at 08:24

1 Answers1

0

The problem I see in your code is in this line

array1[array2[j]] += z[j]+j;

This means that array1 can potentially be modified by whichever j index. And j in the context of the function myfunc() corresponds to index i at the upper level. The trouble is that i is the index upon which the loop is parallelised, therefore, this means that array1 can be modified concurrently at any moment by any thread.

The crucial question now is to know if array2 can have the same value for different indexes:

  • If you are sure that for whatever j1 != j2 you have array2[j1] != array2[j2], then your code is trivially parallelisable.
  • If there are values j1 != j2 for which you have array[j1] == array[j2], then you have dependencies across iterations for array1 and the code is no longer (simply and/or effectively) parallelisable.

So let's assume we are in the former case, then the OpenMP directives you have already in the code are sufficient:

  • i needs to be private but is implicitly already so as it is the index of the parallelised loop;
  • x and y should to be shared (which they are by default) since their access index is the one that is distributed in parallel (namely i) so their parallel updates do not overlap;
  • array2 is only accessed in read mode so it's a no brainer shared (which it is by default again);
  • array1 is read and written, but due to our initial assumption, there are no possible collisions between threads as their sets of indexes to access it are disjoin. Therefore, the default shared qualifier just works fine.

But now, if we are in the case where array2 allows for non-disjoin sets of indexes for accessing array1, we will have to preserve the ordering of these accesses / updates of array1. This can be done with the ordered clause / directive. And since we still want the parallelisation to be (somewhat) effective, we will have to add a schedule(static,1) clause to the parallel directive. For more details about this, please refer to this great answer. Your code would now look like this:

//global variables
static int array1[100000];
static int array2[100000]; 

//part of program code from one of function. 
int i
int x[1000000];
int y[1000000];

#pragma omp parallel for schedule(static,1) ordered
for(i=0; i<100; i++)
{
  y[i] = i*i-3*i-10*random();
  x[i] = myfunc(i, y[i])
}

//additional function
int myfunc(j, z)
int j,
int z[]
{
  int tmp = z[j]+j;
  #pragma omp ordered
  array1[array2[j]] += tmp;
  return array1[j];
}

This would (I think) work and be in term of parallelism not too bad (for a limited number of threads), but this has a big (enormous) flaw: it generates tons of false sharing while updating x and y. Therefore, it might be more advantageous to use some per-thread copies of these and to only update the global arrays at the end. The central part of code snippet would then look something like this (not tested at all):

#pragma omp parallel
#pragma omp single    
int nbth = omp_get_num_threads();

int *xm = malloc(1000000*nbth*sizeof(int));
int *ym = malloc(1000000*nbth*sizeof(int));
#pragma omp parallel
{
    int tid = omp_get_thread_num();
    int *xx = xm+1000000*tid;
    int *yy = ym+1000000*tid;
    #pragma omp for schedule(static,1) ordered
    for(i=0; i<100; i++)
    {
        yy[i] = i*i-3*i-10*random();
        xx[i] = myfunc(i, y[i])
    }
    #pragma omp for
    for (i=0; i<100; i++)
    {
        int j;
        x[i] = 0;
        y[i] = 0;
        for (j=0; j<nbth; j++)
        {
            x[i] += xm[j*1000000+i];
            y[i] += ym[j*1000000+i];
        }
    }
}
free(xm);
free(ym);

This will avoid the false sharing, but will increase the number of memory accesses and the overhead of parallelisation. So it might not be very beneficial after all. You'll have to see it for yourself in your actual code.

BTW, the fact that i only loops until 100 looks suspicious to me when the corresponding arrays are declared to be 1000000 long. If 100 is truly the correct size for the loop, then probably the parallelisation isn't worth it anyway...


EDIT:

As Jim Cownie pointed it out in a comment, I missed the call to random() as source of dependency across iterations, preventing from proper parallelisation. I'm not sure how relevant this is in the context of your actual code (I doubt you truly fill your y array with random data) but in case you do, you'll have to change this part in order to do it in parallel (otherwise, the serialisation needed to have the random number series generated will just kill whichever gain from parallelisation). But generating non-correlated pseudo-random series in parallel is not as simple as it sounds. You can use rand_r() instead of random() as a thread-safe alternative for the RNG and initialise its seed per-thread to different values. However, you're not sure that one thread's series won't collide with another thread's one too soon (with a thread starting to generate the very same series than another one after a while, messing-up your expected asymptotic behaviour).

As I'm pretty sure you're not truly interested in that, I won't develop any further (this is a whole question all by itself), but I will just use the (not so good) rand_r() trick. If you want more details on a possible alternative for generating good parallel random series, just ask another question.

The case where no problem comes from array2 (disjoin sets of indexes), the code would become:

// global variable
unsigned int seed;
#pragma omp threadprivate(seed)

// done just once somewhere
#pragma omp parallel
seed = omp_get_thread_num(); //or something else, but different for each thread

// then the parallelised loop
#pragma omp parallel for  
for(i=0; i<100; i++)
{
  y[i]  = i*i-3*i-10*rand_r(&seed);
  x[i] = myfunc(i, y[i])
}

Then the other case would have to use the same trick in addition to what has already been described. But again, keep in mind that this isn't good enough for serious RNG based computation (like Monte-Carlo methods). Its does the job if all you want is generate some values for testing purpose, but it won't pass any serious statistical quality test.

Community
  • 1
  • 1
Gilles
  • 9,269
  • 4
  • 34
  • 53