Small sized binary searches on CUDA GPUs

Question

I have a large device array inputValues of int64_t type. Every 32 elements of this array are sorted in an ascending order. I have an unsorted search array removeValues.

My intention is to look for all the elements in removeValues inside inputValues and mark them as -1. What is the most efficient method to achieve this? I am using a 3.5 cuda device if that helps.

I am not looking for a higher level solution, i.e. I do not want to use thrust or cub, but I want to write this using cuda kernels.

My initial approach was to load every 32 values in shared memory in a thread block. Every thread also loads a single value from removeValues and does an independent binary search on the shared memory array. If found, the value is set according by using an if condition.

Wouldn't this approach involve a lot of bank conflicts and branch divergence? Do you think that branch divergence can be addressed by using ternary operators while implementing the binary search? Even if that is solved, how can bank conflict be eliminated? Since the size of sorted arrays is 32, would it be possible to implement a binary search using shuffle instructions? Would that help?

EDIT : I have added an example to show what I intend to achieve.

Let's say that inputValues is a vector where every 32 elements are sorted:
[2, 4, 6, ... , 64], [95, 97, ... , 157], [1, 3, ... , 63], [...]

The typical size for this array can range between 32*2 to 32*32. The values could range from 0 to INT64_MAX.

An example of removeValues would be:
[7, 75, 95, 106]

The typical size for this array could range from 1 to 1024.

After the operation removeValues would be: [-1, 75, -1, 106]

The values in inputValues remain unchanged.

There are quite a few reasons for that, performance being one of them. This is a substep of a larger piece of code. The array `inputValues` is generated in a very nontrivial way. You could say that the sorted 32 element chunks of `inputValues` were not contiguous to begin. I do not wish to copy them to a contiguous chunk of memory but rather keep them in registers or shared memory at best. Gathering from separated global memory to contiguous memory does not seem like a good idea to me. — aatish, Jan 05 '16 at 21:13
@m.s. Which thrust function do you think suits my needs in this situation? — aatish, Jan 05 '16 at 21:19
please add a small numerical example of what you want to achieve, maybe there is a thrust way to do it; how large is `inputValues`? how large is `removeValues`? — m.s., Jan 05 '16 at 21:39
I think an example illustrating the desired functionality would be helpful. I have read the question twice and am still not clear as to what exactly the code in question is supposed to do. Why not post the code you already have, so it can be used as a baseline implementation? — njuffa, Jan 05 '16 at 21:51
Another thing : is the data range of those arrayd limited somehow? — m.s., Jan 05 '16 at 22:18
bank conflicts should not be an issue, at least as far as the described proposal goes. On a cc3.5, put the shared memory in 64-bit-bank-mode. A given warp will only ever be looking at 32 contiguous locations, therefore no bank conflicts are possible. Threads in the warp reading from the same location would be serviced by the broadcast mechanism. Bank conflicts would only be possible for a data array that had multiple data items in the same bank. 32 contiguous items total (from the view of a particular warp) could not possibly have more than one item per bank. — Robert Crovella, Jan 05 '16 at 22:30
I don't think warp divergence would be catastrophic. A [naive binary search](http://stackoverflow.com/questions/21658518/search-an-ordered-array-in-a-cuda-kernel) should have at most 2-way divergence, and the compiler might mitigate that to some degree with predication. A warp-shuffle driven binary search on a 32-element array should be possible that involves no divergence (I think), but warp-shuffle natively handles only 32-bit quantities, so two shuffle operations would be require per 64-bit op. I would also be interested in knowing how large is `inputValues` and `removeValues`, typically — Robert Crovella, Jan 05 '16 at 23:14
It's also not clear if you intended to mark the `removeValues` or the `inputValues` with `-1`. And when you mark, were you going to use a separate marker array, or would you replace the given value with `-1`. If the latter, that would also beg the question about the ranges involved (only non-negative values, perhaps?) — Robert Crovella, Jan 05 '16 at 23:16
@RobertCrovella I have added an example. The `inputValues` remain unchanged. Elements in `removeValues` are replaced with -1 if these elements exist in `inputValues`. And yes, all elements involved will be positive. If the signedness of the datatypes is a problem, the values can be marked with UINT64_MAX as well. My only requirement is to mark existing values with a conspicuous number so as to identify 'empty' entries sometime later in the code. — aatish, Jan 05 '16 at 23:35
You did not answer if the range of the values is limited, your example makes it look like the range is very small. Is this true for your real data as well? — m.s., Jan 06 '16 at 02:04
@m.s. I have clearly mentioned in the edit that 'The values could range from 0 toINT64_MAX.' — aatish, Jan 06 '16 at 07:49
Your data size is very small (two arrays of at most 1024 elements). As such, my approach would be to find the simplest possible code to solve the problem, rather than writing a tuned algorithm. So why not use Thrust? Then it's as simple as sort(inputValues), then binary_search(inputValues, removeValues, output)... Since the data size is so small, this will solve it fast enough in a few lines of code and you can move on. This approach is portable and robust. — harrism, Jan 07 '16 at 06:12

score 1 · Answer 1 · edited May 23 '17 at 11:44

I would concur with the answer (now deleted) and comment by @harrism. Since I put some effort into the non-thrust approach, I'll present my findings.

I tried to naively implement a binary search at the warp-level using __shfl(), and then repeat that binary search across the data set, passing the data set through each 32-element group.

It's embarrassing, but my code is around 20x slower than thrust (in fact it may be worse than that if you do careful timing with nvprof).

I made the data sizes a little larger than what was proposed in the question, because the data sizes in the question are so small that the timing is in the dust.

Here's a fully worked example of 2 approaches:

What is approximately outlined in the question, i.e. create a binary search using warp shuffle that can search up to 32 elements against a 32-element ordered array. Repeat this process for as many 32-element ordered arrays as there are, passing the entire data set through each ordered array (hopefully you can start to see some of the inefficiency now.)
Use thrust, essentially the same as what is outlined by @harrism, i.e. sort the grouped data set, and then run a vectorized thrust::binary_search on that.

Here's the example:

$ cat t1030.cu
#include <stdio.h>
#include <assert.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/binary_search.h>

typedef long mytype;

const int gsize = 32;
const int nGRP = 512;
const int dsize = nGRP*gsize;//gsize*nGRP;

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__device__ T my_shfl32(T val, unsigned lane){
  return __shfl(val, lane);
}

template <typename T>
__device__ T my_shfl64(T val, unsigned lane){
  T retval = val;
  int2 t1 = *(reinterpret_cast<int2 *>(&retval));
  t1.x = __shfl(t1.x, lane);
  t1.y = __shfl(t1.y, lane);
  retval = *(reinterpret_cast<T *>(&t1));
  return retval;
}

template <typename T>
__device__ bool bsearch_shfl(T grp_val, T my_val){
  int src_lane = gsize>>1;
  bool return_val = false;
  T test_val;
  int shift = gsize>>2;
  for (int i = 0; i <= gsize>>3; i++){
    if (sizeof(T)==4){
      test_val = my_shfl32(grp_val, src_lane);}
    else if (sizeof(T)==8){
      test_val = my_shfl64(grp_val, src_lane);}
    else assert(0);
    if (test_val == my_val) return_val = true;
    src_lane += (((test_val<my_val)*2)-1)*shift;
    shift>>=1;
    assert ((src_lane < gsize)&&(src_lane > 0));}
  if (sizeof(T)==4){
    test_val = my_shfl32(grp_val, 0);}
  else if (sizeof(T)==8){
    test_val = my_shfl64(grp_val, 0);}
  else assert(0);
  if (test_val == my_val) return_val = true;
  return return_val;
}

template <typename T>
__global__ void bsearch_grp(const T * __restrict__ search_grps, T *data){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int tid = threadIdx.x;
  if (idx < gsize*nGRP){
    T grp_val = search_grps[idx];
    while (tid < dsize){
      T my_val = data[tid];
      if (bsearch_shfl(grp_val, my_val)) data[tid] = -1;
      tid += blockDim.x;}
  }
}


int main(){

  // data setup
  assert(gsize == 32);  //mandatory (warp size)
  assert((dsize % 32)==0);  //needed to preserve shfl capability
  thrust::host_vector<mytype> grps(gsize*nGRP);
  thrust::host_vector<mytype> data(dsize);
  thrust::host_vector<mytype> result(dsize);
  for (int i = 0; i < gsize*nGRP; i++) grps[i] = i;
  for (int i = 0; i < dsize; i++) data[i] = i;
  // method 1: individual shfl-based binary searches on each group
  mytype *d_grps, *d_data;
  cudaMalloc(&d_grps, gsize*nGRP*sizeof(mytype));
  cudaMalloc(&d_data, dsize*sizeof(mytype));
  cudaMemcpy(d_grps, &(grps[0]), gsize*nGRP*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaMemcpy(d_data, &(data[0]), dsize*sizeof(mytype), cudaMemcpyHostToDevice);
  unsigned long long my_time = dtime_usec(0);
  bsearch_grp<<<nGRP, gsize>>>(d_grps, d_data);
  cudaDeviceSynchronize();
  my_time = dtime_usec(my_time);
  cudaMemcpy(&(result[0]), d_data, dsize*sizeof(mytype), cudaMemcpyDeviceToHost);
  for (int i = 0; i < dsize; i++) if (result[i] != -1) {printf("method 1 mismatch at %d, was %d, should be -1\n", i, (int)(result[i])); return 1;}
  printf("method 1 time: %fs\n", my_time/(float)USECPSEC);
  // method 2: thrust sort, followed by thrust binary search
  thrust::device_vector<mytype> t_grps = grps;
  thrust::device_vector<mytype> t_data = data;
  thrust::device_vector<bool> t_rslt(t_data.size());
  my_time = dtime_usec(0);
  thrust::sort(t_grps.begin(), t_grps.end());
  thrust::binary_search(t_grps.begin(), t_grps.end(), t_data.begin(), t_data.end(), t_rslt.begin());
  cudaDeviceSynchronize();
  my_time = dtime_usec(my_time);
  thrust::host_vector<bool> rslt = t_rslt;
  for (int i = 0; i < dsize; i++) if (rslt[i] != true) {printf("method 2 mismatch at %d, was %d, should be 1\n", i, (int)(rslt[i])); return 1;}
  printf("method 2 time: %fs\n", my_time/(float)USECPSEC);

  // method 3:  multiple thrust merges, followed by thrust binary search



  return 0;
}

$ nvcc -O3 -arch=sm_35 t1030.cu -o t1030
$ ./t1030
method 1 time: 0.009075s
method 2 time: 0.000516s
$

I was running this on linux, CUDA 7.5, GT640 GPU. Obviously the performance will be different on different GPUs, but I'd be surprised if any GPU significantly closed the gap.

In short, you'd be well advised to use a well-tuned library like thrust or cub. If you don't like the monolithic nature of thrust, you could try cub. I don't know if cub has a binary search, but a single binary search against the whole sorted data set is not a difficult thing to write, and it's the smaller part of the time involved (for method 2 -- identifiable using nvprof or additional timing code).

Since your 32-element grouped ranges are already sorted, I also pondered the idea of using multiple thrust::merge operations rather than a single sort. I'm not sure which would be faster, but since the thrust method is already so much faster than the 32-element shuffle search method, I think thrust (or cub) is the obvious choice.

Small sized binary searches on CUDA GPUs

1 Answers1