Is it possible to parallelize the task on CUDA?

Question

how to parallelize this task on CUDA and is it possible?

There are two vectors A and B of integer values, they have different number of elements from thousands to millions.

In vector A, values can be repeated and unordered. In vector B, the values are ordered and do not repeat.

For each value (v) from the vector A must be found within [v; v + 10] next largest value, but from vector B, if value not found result = 0.

On the CPU, I sort A, remembering the old indices, then iterate over the values of A in one thread in a for loop using an additional counter for B. After that, I line up the found values of B in the original order of the values of A, and you're done, but I would like to speed up.

like this:

// for example
A = [2, 3, 3, 15, 4, 4, 8, 32, 9, 9, 9, 13, 14]
B = [1, 4, 5, 6, 7, 9, 10, 15, 23, 54]

limit = 10

sortA, inds = sort(A)
sortR = zeros(length(A))

b = 1
for i = 1 : length(sortA)
    while B[b] < sortA[i]
        b += 1
        if b == length(B)
            break
        end
    end
    if B[b] <= (sortA[i] + limit)
        sortR[i] = B[b]
    end
end
R = sortR[inds]

// result:
R = [4, 4, 4, 15, 4, 4, 9, 0, 9, 9, 9, 15, 15]

julia code

# for example
A = Int32[2, 3, 3, 15, 4, 4, 8, 32, 9, 9, 9, 13, 14]
B = Int32[1, 4, 5, 6, 7, 9, 10, 15, 23, 54]

limit = 10

sortA = sort(A)
inds = sortperm(sortperm(A))

sortR = zeros(Int32, length(A))

b = 1
for i = 1 : length(sortA)
    while B[b] < sortA[i]
        b += 1
        if b == length(B)
            break
        end
    end
    if B[b] <= (sortA[i] + limit)
        sortR[i] = B[b]
    end
end
R = sortR[inds]

# result:
R == [4, 4, 4, 15, 4, 4, 9, 0, 9, 9, 9, 15, 15]  # -> true

Can the search within B be optimized, as it is ordered? I.e. with a binary search? Or a tree or hashmap? — Sebastian, Sep 22 '22 at 18:16

paleonix · Accepted Answer · 2022-09-23T01:46:04.960

This answer was initially a simplification of Robert Crovella's solution which he deleted.

Thrust is a C++ library packaged in the CUDA SDK providing parallel algorithms which can be run on Nvidia GPUs but also just on multiple CPU cores via its OMP and TBB backends. AMD has it's own rocThrust, but I don't know if it provides all the features I used in the following code.

The most straight forward solution (I can think of) using Thrust is the following:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>
#include <thrust/transform.h>
#include <iostream>

int main(){

  int A[] = {2, 3, 3, 15, 4, 4, 8, 32, 9, 9, 9, 13, 14};
  int B[] = {1, 4, 5, 6, 7, 9, 10, 15, 23, 54};
  const int size_A = sizeof(A) / sizeof(A[0]);
  const int size_B = sizeof(B) / sizeof(B[0]);
  thrust::host_vector<int> hA(A, A+size_A);
  thrust::host_vector<int> hB(B, B+size_B);

  std::cout << "A = [";
  thrust::copy_n(hA.begin(), size_A, std::ostream_iterator<int>(std::cout, ","));
  std::cout << "]\n"
               "B = [";
  thrust::copy_n(hB.begin(), size_B, std::ostream_iterator<int>(std::cout, ","));
  std::cout << "]\n" << std::flush;

  // data transfer to GPU through constructor
  thrust::device_vector<int> dA(hA);
  thrust::device_vector<int> dB(hB);
  thrust::device_vector<int> dR(hA.size());

  int limit = 10;
  // --------------- actual computation ---------------------
  thrust::lower_bound(dB.cbegin(), dB.cend(), 
                      dA.cbegin(), dA.cend(), 
                      dR.begin());

  thrust::transform(dR.cbegin(), dR.cend(),
                    dA.cbegin(),
                    dR.begin(),
                    [limit, dB = dB.data()] __device__ (int r, int a) {
                      return (dB[r] <= a + limit) ? dB[r] : 0;
                    });
  // --------------------------------------------------------
  // transfer the result back to the CPU
  thrust::host_vector<int> hR(dR);

  std::cout << "R = [";
  thrust::copy_n(hR.begin(), size_A, std::ostream_iterator<int>(std::cout, ","));
  std::cout << "]\n";
}

compiled via nvcc -std=c++17 -extended-lambda.

thrust::lower_bound is a batched version of std::lower_bound (binary search), i.e. for every value in A it finds the smallest index into B where that value could be inserted without violating the ordering.

This code does not sort A at all as this doesn't seem to be a worthwhile optimization when generating R in parallel. As is often the case, this parallel implementation might be less efficient (i.e. it does more work) than an ideal sequential implementation, but this is ok as long as it still gives a speedup.

While this implementation needs to search the whole range of B for each value in A instead of a sub-range, the binary search should somewhat make up for that. As sorting is generally expensive I could even imagine that a sequential version of this solution might be faster than OP's sequential algorithm (might depend on the distribution of numbers in A and B).

Sorting, Permutations and Gather/Scatter

If a sorted version of A is needed, this can also be done in parallel:

#include <thrust/sort.h>
// ...
thrust::sort(dA.begin(), dA.end()); // Caution: there is only in-place sort
// ...

If you also need the indices, you want to zip them onto A (due to lexical comparison of tuples they will not affect the sorting in a meaningful way):

#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
// ...
// fill with 0, 1, ..., size_A - 1 in constructor
thrust::device_vector<int> inds(thrust::make_counting_iterator(0),
                                thrust::make_counting_iterator(0) + size_A);
auto A_and_inds = thrust::make_zip_iterator(A.begin(), inds.begin());
thrust::sort(A_and_inds, A_and_inds + size_A);
// ...

This will give the inverse permutation to the one OP used in his Julia code, but instead of doing another expensive sort to invert it, one can still use it for the same reordering by scattering instead of gathering, i.e.

R[inds] = sortR # is this valid julia code?

There are thrust::scatter and thrust::gather for these operations if needed.

Asynchronous Launch and CUDA Streams

The performance of above code can be increased by asynchronous launches into CUDA streams. This is achieved by passing the execution policy

#include <thrust/system/cuda/execution_policy.h>
// ...
thrust::cuda::par_nosync.on(stream_handle)

as first argument to the Thrust algorithms, e.g.

thrust::lower_bound(thrust::cuda::par_nosync.on(stream_handle),
                    dB.cbegin(), ...);

You might need to manually synchronize afterwards. This also makes the code less portable, i.e. you can't use the CPU backends anymore.

Kernel Fusion

Ideally we would want to fuse the transform of the output of the binary search and the binary search itself into one kernel to avoid unnecessary reads and writes to global memory and kernel launch overheads.

In Thrust this is normally done via thrust::transform_output_iterator, but in this case that is not possible as there is no good way to do a binary transform in that iterator with A as second input.

Another option for achieving fusion is to use the sequential version of thrust::lower_bound inside of a CUDA kernel:

__global__ void find_limited(thrust::device_ptr<const int> A, int size_A,
                             thrust::device_ptr<const int> B, int size_B,
                             thrust::device_ptr<int> R, int limit) {
  const int tid = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = tid; i < size_A; i += blockDim.x * gridDim.x) {
    int j;
    thrust::lower_bound(thrust::seq, 
                        B, B + size_B,
                        A + i, A + i + 1, 
                        &j);
    R[i] = (B[j] <= A[i] + limit) ? B[j] : 0;
  }
}

// ...
// launch configuration might not be optimal
// => might need to tune the block size 
//    and the amount of iterations per thread through a smaller grid size
// => try block sizes 512, 1024
//    and grid sizes size_grid/2, size_grid/4
const int size_block = 256;
const int size_grid = (size_A + size_block - 1) / size_block;
find_limited<<<size_grid, size_block, stream_handle>>>(
            dA.data(), size_A, 
            dB.data(), size_B, 
            dR.data(), limit);
// ...

No guarantee that this is faster. One should benchmark both versions before using one of them in production.

A similar version could probably also be written without the explicit use of a CUDA kernel by doing the same inside a functor/lambda which is then called via thrust::transform, but this example seemed more instructive (especially as OP asked for CUDA and not Thrust). So I will leave this as an exercise to the reader ;)

Generally you might not find incredibly good performance for small input data (1K is quite small in the GPU space, 1M is much better but still not big), as it might not be enough to fill up a modern GPU (and it needs to be filled up for latency hiding by design). You also need quite a lot of work to hide the data transfer between host and device, if the data isn't yet there from previous work and/or will be used there afterwards.

Is it possible to parallelize the task on CUDA?

1 Answers1

Sorting, Permutations and Gather/Scatter

Asynchronous Launch and CUDA Streams

Kernel Fusion