4

I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.

The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.

The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower. Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.

My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.

See the example below.

Anyone has any clue on how to make this more effective with OpenMP?

void redBlackSmooth(std::vector<double> const & b,
                    std::vector<double> & x,
                    double h)
{
    // Setup relevant constants.
    double const invh2 = 1.0/(h*h);
    double const h2 = (h*h);
    int const N = static_cast<int>(x.size());
    double sigma = 0;

    // Setup some boundary conditions.
    x[0] = 0.0;
    x[N-1] = 0.0;

    // Red sweep.
    #pragma omp parallel for shared(b, x) private(sigma)
    for (int i = 1; i < N-1; i+=2)
    {
        sigma = -invh2*(x[i-1] + x[i+1]);
        x[i] = (h2/2.0)*(b[i] - sigma);
    }

    // Black sweep.
    #pragma omp parallel for shared(b, x) private(sigma)
    for (int i = 2; i < N-1; i+=2)
    {
        sigma = -invh2*(x[i-1] + x[i+1]);
        x[i] = (h2/2.0)*(b[i] - sigma);
    }
}

Addition: I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.

Smidstrup
  • 81
  • 2
  • 10
  • How large exactly are the vectors involved? I'm surprised because you say you're comparing with MPICH, and this kind of linear-complexity task doesn't necessarily seem to benefit from distributed-memory parallelism. – us2012 Mar 30 '13 at 14:02
  • The size of vectors is in the size of millions. In MPICH I am able to split that each sweep into 4 separate chuncks, execute these, and broadcast the results, and get a decent speed-up. – Smidstrup Mar 30 '13 at 14:17
  • 2
    Sounds like cache-fighting to me. Have you tried to split your computational domain in two separate vectors and merge your data together at the end of the algorithm? – sbabbi Mar 30 '13 at 15:49
  • I have tried your suggestion, and it does improve things a bit. So what I have done, is that instead of putting the #pragma parallel in front of the for loop, I have added another for loop outside this loop and added the #pragma here instead. – Smidstrup Mar 30 '13 at 17:04
  • I'd vote up sbabbi's response if it were an answer. – duffymo Mar 30 '13 at 17:28
  • try to run the NAS parallel benchmarks. It has both MPI and OpenMP versions. In particular try LU benchmark. It uses Gauss Seidel solver – arunmoezhi Mar 31 '13 at 00:29
  • I have found this guide, which suggested using firstprivate instead of shared. It gave something like 20% improvement. http://docs.oracle.com/cd/E19059-01/stud.10/819-0501/7_tuning.html – Smidstrup Apr 01 '13 at 16:39

2 Answers2

1

First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:

double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);

Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.

In my case this implementation is much faster when x is not cache-aligned.

const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;

#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
    base = j * workGroupSize;
    for (int k = 0; k < workGroupSize; k+=2)
    {
        i = base + k + (redSweep ? 1 : 0);
        if ( i == 0 || i+1 == N) continue;
        sigma = -invh2* ( x[i-1] + x[i+1] );
        x[i] = ( h2/2.0 ) * ( b[i] - sigma );
    }
}

In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.

sbabbi
  • 11,070
  • 2
  • 29
  • 57
  • I have tried using the alignment of memory according to your suggestion, and it does improve the speed very significant. However I am still not able get a decent scaling to more than 2 threads. – Smidstrup Apr 01 '13 at 12:14
  • However doing block-wise as you suggest or using a simple loop gives the same performance, once the memory has been aligned. – Smidstrup Apr 01 '13 at 15:33
  • Unfortunatly I have access only to a dual core machine atm, so I can not test your code with more than 2 threads :(. – sbabbi Apr 02 '13 at 17:36
0

I think the main problem is about type of array structure you are using. Lets try comparing results with vectors and arrays. (Arrays = c-arrays using new operator).

Vector and array sizes are N = 10000000. I force the smoothing function to repeat in order to maintain runtime > 0.1secs.

Vector Time:    0.121007        Repeat: 1       MLUPS:  82.6399

Array Time:     0.164009        Repeat: 2       MLUPS:  121.945

MLUPS = ((N-2)*repeat/runtime)/1000000 (Million Lattice Points Update per second)

MFLOPS are misleading when it comes to grid calculation. A few changes in the basic equation can lead to consider high performance for the same runtime.

The modified code:

double my_redBlackSmooth(double *b, double* x, double h, int N)
{
    // Setup relevant constants.
    double const invh2 = 1.0/(h*h);
    double const h2 = (h*h);
    double sigma = 0;

    // Setup some boundary conditions.
    x[0] = 0.0;
    x[N-1] = 0.0;

    double runtime(0.0), wcs, wce;
    int repeat = 1;
    timing(&wcs);
    for(; runtime < 0.1; repeat*=2)
    {
        for(int r = 0; r < repeat; ++r)
        {
            // Red sweep.
            #pragma omp parallel for shared(b, x) private(sigma)
            for (int i = 1; i < N-1; i+=2)
            {
                sigma = -invh2*(x[i-1] + x[i+1]);
                x[i] = (h2*0.5)*(b[i] - sigma);
            }
            // Black sweep.
            #pragma omp parallel for shared(b, x) private(sigma)
            for (int i = 2; i < N-1; i+=2)
            {
                sigma = -invh2*(x[i-1] + x[i+1]);
                x[i] = (h2*0.5)*(b[i] - sigma);
            }
            //  cout << "In Array:      " << r << endl;
        }
        if(x[0] != 0) dummy(x[0]);
        timing(&wce);
        runtime = (wce-wcs);
    }
    //  cout << "Before division:   " << repeat << endl;
    repeat /= 2;
    cout << "Array Time:\t" << runtime << "\t" << "Repeat:\t" << repeat
         << "\tMLUPS:\t" << ((N-2)*repeat/runtime)/1000000.0 << endl;

    return runtime;
}

I didn't change anything in the code except than array type. For better cache access and blocking you should look into data alignment (_mm_malloc).

DOOM
  • 1,170
  • 6
  • 20