8

I tried to write this code

float* theArray; // the array to find the minimum value
int   index, i;
float thisValue, min;

index    = 0;
min = theArray[0];
#pragma omp parallel for reduction(min:min_dist)
for (i=1; i<size; i++) {
    thisValue = theArray[i];
    if (thisValue < min)

    { /* find the min and its array index */

        min = thisValue;

        index    = i;
    }
}
return(index);

However this one is not outputting correct answers. Seems the min is OK but the correct index has been destroyed by threads.

I also tried some ways provided on the Internet and here (using parallel for for outer loop and use critical for final comparison) but this cause a speed drop rather than speedup.

What should I do to make both the min value and its index correct? Thanks!

xxbidiao
  • 834
  • 5
  • 14
  • 27
  • 1
    Isn't it bad to share variables (thisValue, min) between threads without a mutex? – Neil Kirk Feb 01 '15 at 01:50
  • Seems like this reduction did a lot things, so I'm not so clear whether it is shared or not. I have tried a version without reduction, threads run freely ,the assign part is critical, and the running speed actually goes down. – xxbidiao Feb 01 '15 at 06:17

4 Answers4

18

I don't know of an elegant want to do a minimum reduction and save an index. I do this by finding the local minimum and index for each thread and then the global minimum and index in a critical section.

index = 0;
min = theArray[0];
#pragma omp parallel
{
    int index_local = index;
    float min_local = min;  
    #pragma omp for nowait
    for (i = 1; i < size; i++) {        
        if (theArray[i] < min_local) {
            min_local = theArray[i];
            index_local = i;
        }
    }
    #pragma omp critical 
    {
        if (min_local < min) {
            min = min_local;
            index = index_local;
        }
    }
}

With OpenMP 4.0 it's possible to use user-defined reductions. A user-defined minimum reduction can be defined like this

struct Compare { float val; sizt_t index; };    
#pragma omp declare reduction(minimum : struct Compare : omp_out = omp_in.val < omp_out.val ? omp_in : omp_out)

Then the reduction can be done like this

struct Compare min; 
min.val = theArray[0]; 
min.index = 0;
#pragma omp parallel for reduction(minimum:min)
for(int i = 1; i<size; i++) {
    if(theArray[i]<min.val) { 
        min.val = a[i];
        min.index = i;
    }
}

That works for C and C++. User defined reductions have other advantages besides simplified code. There are multiple algorithms for doing reductions. For example the merging can be done in O(number of threads) or O(Log(number of threads). The first solution I gave does this in O(number of threads) however using user-defined reductions let's OpenMP choose the algorithm.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • I'm not sure why, but I'm getting incorrect results with this code. – Avi Ginsburg Feb 02 '15 at 11:25
  • Got it. `theArray[i] < min` should be `theArray[i] < min_local`. +1 for being faster than the optimized non-OMP version. – Avi Ginsburg Feb 02 '15 at 11:40
  • I'm also getting incorrect results with this code. DId you fix that, Avi? – xxbidiao Feb 02 '15 at 16:57
  • @xxbidiao As of when you posted your comment, the code was corrected. – Avi Ginsburg Feb 03 '15 at 06:25
  • Very nice edit. Clean and clear. Now, to benchmark... (not today...). – Avi Ginsburg Feb 04 '15 at 13:16
  • @AviGinsburg, thanks. I don't think it will make a difference in performance. Although OpenMP could merge in O(Log(nthreads)) I think in practice O(threads) wins because of the overhead to merge in O(Log(nthreads)) is much higher. But it would be interesting to find a case where it makes a difference. Maybe on the GPU OpenMP will merge O(Log(nthreads)) ... – Z boson Feb 04 '15 at 13:23
4

Basic Idea

This can be accomplished without any parellelization-breaking critical or atomic sections by creating a custom reduction. Basically, define an object that stores both the index and value, and then create a function that sorts two of these objects by only the value, not the index.

Details

An object to store an index and value together:

typedef std::pair<unsigned int, float> IndexValuePair;

You can access the index by accessing the first property and the value by accessing the second property, i.e.,

IndexValuePair obj(0, 2.345);
unsigned int ix = obj.first;  // 0
float val = obj.second; // 2.345

Define a function to sort two IndexValuePair objects:

IndexValuePair myMin(IndexValuePair a, IndexValuePair b){
    return a.second < b.second ? a : b;
}

Then, construct a custom reduction following the guidelines in the OpenMP documentation:

#pragma omp declare reduction \
(minPair:IndexValuePair:omp_out=myMin(omp_out, omp_in)) \
initializer(omp_priv = IndexValuePair(0, 1000))

In this case, I've chosen to initialize the index to 0 and the value to 1000. The value should be initialized to some number larger than the largest value you expect to sort.

Functional Example

Finally, combine all these pieces with the parallel for loop!

// Compile with g++ -std=c++11 -fopenmp demo.cpp
#include <iostream>
#include <utility>
#include <vector>

typedef std::pair<unsigned int, float> IndexValuePair;

IndexValuePair myMin(IndexValuePair a, IndexValuePair b){
    return a.second < b.second ? a : b;
}

int main(){

    std::vector<float> vals {10, 4, 6, 2, 8, 0, -1, 2, 3, 4, 4, 8};
    unsigned int i;

    IndexValuePair minValueIndex(0, 1000);

    #pragma omp declare reduction \
        (minPair:IndexValuePair:omp_out=myMin(omp_out, omp_in)) \
        initializer(omp_priv = IndexValuePair(0, 1000))

    #pragma omp parallel for reduction(minPair:minValueIndex)
    for(i = 0; i < vals.size(); i++){

        if(vals[i] < minValueIndex.second){
            minValueIndex.first = i;
            minValueIndex.second = vals[i];
        }
    }

    std::cout << "minimum value = " << minValueIndex.second << std::endl;   // Should be -1
    std::cout << "index = " << minValueIndex.first << std::endl;    // Should be 6


    return EXIT_SUCCESS;

}
AndrewCox
  • 1,044
  • 7
  • 14
2

Because you're not only trying to find the minimal value (reduction(min:___)) but also retain the index, you need to make the check critical. This can significantly slow down the loop (as reported). In general, make sure that there is enough work so you don't encounter overhead as in this question. An alternative would be to have each thread find the minimum and it's index and save them to a unique variable and have the master thread do a final check on those as in the following program.

#include <iostream>
#include <vector>
#include <ctime>
#include <random>
#include <omp.h>

using std::cout;
using std::vector;

void initializeVector(vector<double>& v)
{
    std::mt19937 generator(time(NULL));
    std::uniform_real_distribution<double> dis(0.0, 1.0);
    v.resize(100000000);
    for(int i = 0; i < v.size(); i++)
    {
        v[i] = dis(generator);
    }
}

int main()
{
    vector<double> vec;
    initializeVector(vec);

    float minVal = vec[0];
    int minInd = 0;

    int startTime = clock();

    for(int i = 1; i < vec.size(); i++)
    {
        if(vec[i] < minVal)
        {
            minVal = vec[i];
            minInd = i;
        }

    }

    int elapsedTime1 = clock() - startTime;

    // Change the number of threads accordingly
    vector<float> threadRes(4, std::numeric_limits<float>::max());
    vector<int>   threadInd(4);

    startTime = clock();
#pragma omp parallel for
    for(int i = 0; i < vec.size(); i++)
    {
        {
            if(vec[i] < threadRes[omp_get_thread_num()])
            {
                threadRes[omp_get_thread_num()] = vec[i];
                threadInd[omp_get_thread_num()] = i;
            }
        }

    }

    float minVal2 = threadRes[0];
    int minInd2 = threadInd[0];

    for(int i = 1; i < threadRes.size(); i++)
    {
        if(threadRes[i] < minVal2)
        {
            minVal2 = threadRes[i];
            minInd2 = threadInd[i];
        }
    }

    int elapsedTime2 = clock() - startTime;

    cout << "Min " << minVal << " at " << minInd << " took " << elapsedTime1 << std::endl;
    cout << "Min " << minVal2 << " at " << minInd2 << " took " << elapsedTime2 << std::endl;
}

Please note that with optimizations on and nothing else to be done in the loop, the serial version seems to remain king. With optimizations turned off, OMP gains the upper hand.

P.S. you wrote reduction(min:min_dist) and the proceeded to use min instead of min_dist.

Community
  • 1
  • 1
Avi Ginsburg
  • 10,323
  • 3
  • 29
  • 56
  • false sharing and `clock` – Z boson Feb 02 '15 at 11:12
  • @Zboson I'll admit the false sharing point, but `clock` is good enough as the vector size is large enough. If it makes that much of a difference, I'll change it to `chrono::high_resolution_clock`. – Avi Ginsburg Feb 02 '15 at 12:40
  • [Don't use `clock()` for timing multi-threaded code except with MSVC](https://stackoverflow.com/questions/10874214/measure-execution-time-in-c-openmp-code). Use `omp_get_wtime()` as it's reliable with MSVC, GCC, and ICC. – Z boson Feb 02 '15 at 13:27
  • @Zboson Thanks, I've never had that issue but will keep it in mind. – Avi Ginsburg Feb 02 '15 at 14:11
  • I think there is a way to do this using the reduction construct with OpenMP4.0 by creating custom reductions and operation overloading with C++. But you're using MSVC? It only supports OpenMP 2.0. – Z boson Feb 03 '15 at 09:41
  • I updated my answer using a user-defined reduction from OpenMP 4.0. I just figured out how to get this working. I think it's a much better solution. – Z boson Feb 04 '15 at 13:05
  • I'm on GCC. I'll try it out. Thanks! – xxbidiao Feb 04 '15 at 22:11
0

Actually, we can use omp critical directive to make only one thread run the code inside the critical region at a time.So only one thread can run it and the indexvalue wont be destroyed by other threads.

About omp critical directive:

The omp critical directive identifies a section of code that must be executed by a single thread at a time.

This code solves your issue:

#include <stdio.h>
#include <omp.h>
int main() {
int i;
int arr[10] = {11,42,53,64,55,46,47, 68, 59, 510};

float* theArray; // the array to find the minimum value
int   index;
float thisValue, min;
index    = 0;
min = arr[0];
int size=10;
#pragma omp parallel for
for (i=1; i<size; i++) {
    thisValue = arr[i];
    #pragma omp critical
    if (thisValue < min)

    { /* find the min and its array index */

        min = thisValue;

        index    = i;
    }
}
printf("min:%d index:%d",min,index);
return 0;
}
  • 1
    I'd like to bring to your attention that 1) both your code and the OP's code contain a data race/data races. In your case, this can easily be solved by using, for example, `const float thisValue = arr[i];` or just by removing this variable. 2) The bigger issue is that your code runs sequentially (and has parallel overhead), making it slower than the original serial code. – Laci Feb 10 '23 at 08:34