CUDA: Better performances with lower occupancy

Question

I'm a CUDA learning student and I'm trying to write a CUDA algorithm for counting sort:

__global__ void kernelCountingSort(int *array, int dim_array, int *counts) {
    // define index
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int count = 0;
    // check that the thread is not out of the vector boundary
    if (i >= dim_array) return;

    for (int j = 0; j < dim_array; ++j) {
        if (array[j] < array[i])
            count++;
        else if (array[j] == array[i] && j < i)
            count++;
    }
    counts[count] = array[i];

I tried to analyze my algorithm performances with increasing block size, that's the time graph with corrisponding block size:

With 64 as block size I have 100% of occupancy, however I achive the best performances, so the minumum execution time, with a 32 block size. I'm asking if it's possible to have better performances with less occupancy.

I'm using colab with a Tesla T4, with the following specs:

I'm testing the algorithm with an integer array with 5000 elements. — Roberto Falcone, Jan 18 '22 at 17:10
[This](https://stackoverflow.com/q/68538553/10107454) might be of interest. — paleonix, Jan 18 '22 at 17:51
Note that the algorithm is not efficient since it has a *quadratic complexity* (ie. `O(n*n)`). Indeed, it iterates over all items of the array just to find the location to insert the item in the output array. There are much faster sorting algorithm working on GPUs like the Bitonic sort (`O(n (log n)^2)`). State of the art algorithms often use a radix sort (`O(n)` -- this is what the CUB highly-optimized CUDA library does). — Jérôme Richard, Jan 18 '22 at 18:18
Note that the access ```counts[count] = array[i]``` has a race condition. Two threads can come up with the same count, right? And shouldn't it be ```counts[array[i]] = count```? — Homer512, Jan 19 '22 at 09:16
I checked and it's impossible that two threads can come up with the same count with the controls I made in my if conditions. — Roberto Falcone, Jan 19 '22 at 09:30

score 2 · Answer 1 · answered Jan 20 '22 at 22:26

I'm asking if it's possible to have better performances with less occupancy.

Yes, it's possible, and well regarded papers have been written on that topic.

Explaining whether that makes sense in your particular case, using an incomplete snippet of code, and no information about GPU or execution environment, is not possible.

CUDA: Better performances with lower occupancy

1 Answers1