What to heed, when reading an array from multiple threads?

Question

I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).

The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.

My questions are:

Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?

Code

#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>

double serial(int x)
{
    double sum=0;
    int i;

    for(i = 0; i<x; i++){
        sum += sin(x*i) / cos(x*i+1.0);
    }
    return sum;
}

double parallel(int x)
{
    double sum=0;
    int i;

    #pragma omp parallel num_threads(6) shared(sum) //default(none) 
    {
        //printf("    I'm thread no. %d\n", omp_get_thread_num());

        #pragma omp for private(i, x) reduction(+: sum)
        for(i = 0; i<x; i++){
            sum += sin(x*i) / cos(x*i+1.0);
        }
    }
    return sum;
}

void serial_a(double* a, int n, double* y2)
{
    int i;

    for(i = 0; i<n; i++){
         y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
    }
}

void parallel_a(double* a, int n, double* y2)
{
    int i;

    #pragma omp parallel num_threads(6)
    {       
        #pragma omp for private(i)
        for(i = 0; i<n; i++){
            y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
        }
    }
}

void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
    double sum, *y1, *y2, *a, s, p;
    int x, n, *d;

    /* Check for proper number of arguments. */
    if(nrhs!=2) {
        mexErrMsgTxt("Two inputs required.");
    } else if(nlhs>2) {
        mexErrMsgTxt("Too many output arguments.");
    }
    /* Get pointer to first input */
    x = (int)mxGetScalar(prhs[0]);

    /* Get pointer to second input */
    a = mxGetPr(prhs[1]);
    d = (int*)mxGetDimensions(prhs[1]);
    n = (int)d[1]; // row vector

    /* Create space for output */
    plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
    plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);

    /* Get pointer to output array */
    y1 = mxGetPr(plhs[0]);
    y2 = mxGetPr(plhs[1]);

    {   /* Do the calculation */
        clock_t tic = clock();
        y1[0] = serial(x);
        s = (double) clock()-tic;
        printf("serial....: %.0f ms\n", s);
        mexEvalString("drawnow");

        tic = clock();
        y1[1] = parallel(x);
        p = (double) clock()-tic;
        printf("parallel..: %.0f ms\n", p);
        printf("ratio.....: %.2f \n", p/s);
        mexEvalString("drawnow");

        tic = clock();
        serial_a(a, n, y2);
        s = (double) clock()-tic;
        printf("serial_a..: %.0f ms\n", s);
        mexEvalString("drawnow");

        tic = clock();
        parallel_a(a, n, &y2[n]);
        p = (double) clock()-tic;
        printf("parallel_a: %.0f ms\n", p);
        printf("ratio.....: %.2f \n", p/s); 
    }
}

Output

>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21 
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21 
>> a(1) == a(2)

ans =

     0

>> all(b(:,1) == b(:,2))

ans =

     1

System

MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867

Z boson · Answer 1 · 2014-09-25T22:23:00.717

In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.

To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do

#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
    #pragma omp for 
    for(i = 0; i<x; i++){
        sum += sin(x*i) / cos(x*i+1.0);
    }
}

However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do

#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
    sum += sin(x*i) / cos(x*i+1.0);
}

Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.

When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as

#pragma omp parallel for
for(i = 0; i<n; i++){
    y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}

and once again it is the same as your serial_a function except for the pragma statement.

But be careful with assuming iterators are private. Consider the following double loop

for(i=0; i<n; i++) {
    for(j=0; j<m; j++) {
       //
    }
}

If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).

*as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private* okay. But what if the same elements of an array have to be read by different threads? Does OMP take care of this, e. g. a thread simply waits in case an elements is being read by another? Or is reading access not issue at all? And what about the case, that an elements possibly has to be changed by differents threads? Is that possible at all? — embert, Sep 26 '14 at 03:49
If the same elements of an array have to be read by different threads you don't need to worry about this. Each thread will pull in the data to its local cache. However, you do have to worry if more then one thread will write to the same array element. This can cause race conditions. You can also have problems writing to elements that are not the same but close locally. This is called false sharing. — Z boson, Sep 26 '14 at 07:43

What to heed, when reading an array from multiple threads?

1 Answers1