Is there a faster argmin/argmax implementation in OpenACC?

Question

Is there a faster alternative for computing the argmin in OpenACC, than splitting the work in a minimum-reduction loop and another loop to actually find the index of the minimum?

This looks very wasteful:

    float minVal = std::numeric_limits<float>::max();
    #pragma acc parallel loop reduction(min: minVal)
    for(int i = 0; i < arraySize; ++i) {
        minVal = fmin(minVal, array[i]);
    }
    #pragma acc parallel loop
    for(int i = 0; i < arraySize; ++i) {
        if(array[i] == minVal){
            minIndex = i;
        }
    }

In fact, this became a bottleneck for my current project.

Please note that your code contains a race condition if `minval` appear multiple times in the array. — Jérôme Richard, Jun 10 '21 at 21:27
@JérômeRichard True, but would it matter in an application? In other words, should there be a `#pragma acc atomic write` above the `minIndex = i`? Or do you mean, that the result is uncertain with multiple equal `minVal`s? The later should not matter in most applications, afaik. — Dunkelkoon, Jun 11 '21 at 09:53
First of all, the result will likely be not deterministic while I may not be a problem in your application. That being said, yes, I think putting at least an atomic write is important to avoid hardware-related weird effects due to the race condition. Indeed, while I should not be a critical issue on most GPUs, nothing prevent some GPUs to write in `minIndex` non-atomically resulting in wrong results. Note that I think alsmost all mainstream modern GPUs writes 4-bytes values atomically, so in practice such an effect should not appear. You can get deterministic results with an atomic min/max. — Jérôme Richard, Jun 11 '21 at 10:13

score 2 · Accepted Answer · answered Jun 10 '21 at 16:53

2

We've gotten requests for minloc/maxloc but it's difficult and would most likely not be performant, so not something that's been added. The method you're using is the recommended solution for this.

answered Jun 10 '21 at 16:53

Mat Colgrove

5,441
1
10
11

1

Here the computation will likely be *memory bound* on most GPUs. Even a pretty bad implementation of minloc/maxloc would be faster on many GPUs as long as the implementation could scale. One can search for a minloc/maxloc per block and then use atomic CAS to perform the reduction between the blocks. Alternatively, one can perform a two-pass reduction for old/featureless GPUs. The resulting implementation can be up to 2 time faster here (and will likely be on many modern GPUs). – Jérôme Richard Jun 10 '21 at 21:40
@JérômeRichard Could you outline your suggestion with some code in an answer? This would be a tremendous help for me. – Dunkelkoon Jun 11 '21 at 09:49
@Dunkelkoon note that I am talking about a possible backend (partial) implementation. Thus, not an OpenACC code, but rather something like a Cuda code (I guess the same thing could be achieve with OpenCL but I am very familiar with it). Is that ok for you? – Jérôme Richard Jun 11 '21 at 09:58
@JérômeRichard Ah, my bad. Thanks for clearing that up. – Dunkelkoon Jun 11 '21 at 10:12

Is there a faster argmin/argmax implementation in OpenACC?

1 Answers1