1

Is there a faster alternative for computing the argmin in OpenACC, than splitting the work in a minimum-reduction loop and another loop to actually find the index of the minimum?

This looks very wasteful:

    float minVal = std::numeric_limits<float>::max();
    #pragma acc parallel loop reduction(min: minVal)
    for(int i = 0; i < arraySize; ++i) {
        minVal = fmin(minVal, array[i]);
    }
    #pragma acc parallel loop
    for(int i = 0; i < arraySize; ++i) {
        if(array[i] == minVal){
            minIndex = i;
        }
    }

In fact, this became a bottleneck for my current project.

Dunkelkoon
  • 398
  • 2
  • 10
  • Please note that your code contains a race condition if `minval` appear multiple times in the array. – Jérôme Richard Jun 10 '21 at 21:27
  • @JérômeRichard True, but would it matter in an application? In other words, should there be a `#pragma acc atomic write` above the `minIndex = i`? Or do you mean, that the result is uncertain with multiple equal `minVal`s? The later should not matter in most applications, afaik. – Dunkelkoon Jun 11 '21 at 09:53
  • First of all, the result will likely be not deterministic while I may not be a problem in your application. That being said, yes, I think putting at least an atomic write is important to avoid hardware-related weird effects due to the race condition. Indeed, while I should not be a critical issue on most GPUs, nothing prevent some GPUs to write in `minIndex` non-atomically resulting in wrong results. Note that I think alsmost all mainstream modern GPUs writes 4-bytes values atomically, so in practice such an effect should not appear. You can get deterministic results with an atomic min/max. – Jérôme Richard Jun 11 '21 at 10:13

1 Answers1

2

We've gotten requests for minloc/maxloc but it's difficult and would most likely not be performant, so not something that's been added. The method you're using is the recommended solution for this.

Mat Colgrove
  • 5,441
  • 1
  • 10
  • 11
  • 1
    Here the computation will likely be *memory bound* on most GPUs. Even a pretty bad implementation of minloc/maxloc would be faster on many GPUs as long as the implementation could scale. One can search for a minloc/maxloc per block and then use atomic CAS to perform the reduction between the blocks. Alternatively, one can perform a two-pass reduction for old/featureless GPUs. The resulting implementation can be up to 2 time faster here (and will likely be on many modern GPUs). – Jérôme Richard Jun 10 '21 at 21:40
  • @JérômeRichard Could you outline your suggestion with some code in an answer? This would be a tremendous help for me. – Dunkelkoon Jun 11 '21 at 09:49
  • @Dunkelkoon note that I am talking about a possible backend (partial) implementation. Thus, not an OpenACC code, but rather something like a Cuda code (I guess the same thing could be achieve with OpenCL but I am very familiar with it). Is that ok for you? – Jérôme Richard Jun 11 '21 at 09:58
  • @JérômeRichard Ah, my bad. Thanks for clearing that up. – Dunkelkoon Jun 11 '21 at 10:12