Thrust CUDA find maximum per each group(segment)

Question

My data like

value = [1, 2, 3, 4, 5, 6]
key =   [0, 1, 0, 2, 1, 2]

I need to now maximum(value and index) per each group(key). So the result should be

max = [3, 5, 6]
index = [2, 4, 5]
key = [0, 1, 2]

How can I get it with cuda thrust? I can do sort -> reduce_by_key but it's not really efficient. In my case vector size > 10M and key space ~ 1K(starts from 0 without gaps).

Use thrust::sort_by_key to group like keys together. Then use thrust::reduce_by_key along with a zip_iterator and a counting_iterator (for the index) to find the max in each key along with its index. — Robert Crovella, Aug 12 '16 at 17:49
It took me a minute to understand what you were asking. You should edit the first part to be clearer. — Keith M, Aug 15 '16 at 21:41
If I understand right you can try to use for example map for storing data with key from you're key array and value which contains data about index and value something like this `map>>` (you also can create struct instead pair which has index and value field) than find max in each element of the map by value in parallel on CUDA — segevara, Aug 16 '16 at 05:02
Instead of trying to use thrust to solve your particular problem, why not try writing a custom kernel? A single kernel with atomic operations can get you max values, you can get max value and index using two passes. — Pavan Yalamanchili, Aug 17 '16 at 05:08
@PavanYalamanchili that is also acceptable. Could you provide some code — sh1ng, Aug 17 '16 at 07:25
are there any bounds on the elements of `value` and `key`? any known min and max values? are there any patterns in your dataset that could be exploited? did you benchmark the `reduce_by_key` solution? — m.s., Aug 17 '16 at 07:28
@m.s. I'm going to implement it with sort -> reduce_by_key first and make some measurement. value is just a `float` key(s) from `0` to `N`. Usually N < 10K. Elements distributed randomly. — sh1ng, Aug 17 '16 at 07:51
There are 2 ways. **1.** sort_by_key -> reduce_by_key **2.** HashMatch by using 400 MB `float arr_keys[10000][10000];` and the same array for values, then use `transform` with `permutation_iterator` which use your hash-functor for mapping key to X (`arr[X][Y]`), then iterate Y for each X and compare key - to avoid collisions, and use `atomicCas()` for new collision, or `atomicMin()` for value if key matched. If value is float: https://devtalk.nvidia.com/default/topic/492068/atomicmin-with-float/ Or by using HASH-MAP (CUDPP hash_table): https://github.com/cudpp/cudpp/tree/master/src/cudpp_hash — Alex, Aug 17 '16 at 10:17
@sh1ng why not try writing it yourself first and see what problems you are facing instead of asking for a full solution. — Pavan Yalamanchili, Aug 17 '16 at 13:46

score 4 · Accepted Answer · edited May 23 '17 at 12:33

Since the original question focused on thrust, I didn't have any suggestions other than what I mentioned in the comments,

However, based on further dialog in the comments, I thought I would post an answer that covers both CUDA and thrust.

The thrust method uses a sort_by_key operation to group like keys together, followed by a reduce_by_key operation to find the max + index for each key-group.

The CUDA method uses a custom atomic approach I describe here to find a 32-bit max plus 32-bit index (for each key-group).

The CUDA method is substantially (~10x) faster, for this specific test case. I used a vector size of 10M and a key size of 10K for this test.

My test platform was CUDA 8RC, RHEL 7, and Tesla K20X GPU. K20X is a member of the Kepler generation which has much faster global atomics than previous GPU generations.

Here's a fully worked example, covering both cases, and providing a timing comparison:

$ cat t1234.cu
#include <iostream>
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/sort.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/sequence.h>
#include <thrust/functional.h>
#include <cstdlib>

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

const size_t ksize = 10000;
const size_t vsize = 10000000;
const int nTPB = 256;

struct my_max_func
{

  template <typename T1, typename T2>
  __host__ __device__
  T1 operator()(const T1 t1, const T2 t2){
    T1 res;
    if (thrust::get<0>(t1) > thrust::get<0>(t2)){
      thrust::get<0>(res) = thrust::get<0>(t1);
      thrust::get<1>(res) = thrust::get<1>(t1);}
    else {
      thrust::get<0>(res) = thrust::get<0>(t2);
      thrust::get<1>(res) = thrust::get<1>(t2);}
    return res;
    }
};

typedef union  {
  float floats[2];                 // floats[0] = maxvalue
  int ints[2];                     // ints[1] = maxindex
  unsigned long long int ulong;    // for atomic update
} my_atomics;


__device__ unsigned long long int my_atomicMax(unsigned long long int* address, float val1, int val2)
{
    my_atomics loc, loctest;
    loc.floats[0] = val1;
    loc.ints[1] = val2;
    loctest.ulong = *address;
    while (loctest.floats[0] <  val1)
      loctest.ulong = atomicCAS(address, loctest.ulong,  loc.ulong);
    return loctest.ulong;
}


__global__ void my_max_idx(const float *data, const int *keys,const int ds, my_atomics *res)
{

    int idx = (blockDim.x * blockIdx.x) + threadIdx.x;
    if (idx < ds)
      my_atomicMax(&(res[keys[idx]].ulong), data[idx],idx);
}


int main(){

  float *h_vals = new float[vsize];
  int   *h_keys = new int[vsize];
  for (int i = 0; i < vsize; i++) {h_vals[i] = rand(); h_keys[i] = rand()%ksize;}
// thrust method
  thrust::device_vector<float> d_vals(h_vals, h_vals+vsize);
  thrust::device_vector<int> d_keys(h_keys, h_keys+vsize);
  thrust::device_vector<int> d_keys_out(ksize);
  thrust::device_vector<float> d_vals_out(ksize);
  thrust::device_vector<int> d_idxs(vsize);
  thrust::device_vector<int> d_idxs_out(ksize);

  thrust::sequence(d_idxs.begin(), d_idxs.end());
  cudaDeviceSynchronize();
  unsigned long long et = dtime_usec(0);

  thrust::sort_by_key(d_keys.begin(), d_keys.end(), thrust::make_zip_iterator(thrust::make_tuple(d_vals.begin(), d_idxs.begin())));
  thrust::reduce_by_key(d_keys.begin(), d_keys.end(), thrust::make_zip_iterator(thrust::make_tuple(d_vals.begin(),d_idxs.begin())), d_keys_out.begin(), thrust::make_zip_iterator(thrust::make_tuple(d_vals_out.begin(), d_idxs_out.begin())), thrust::equal_to<int>(), my_max_func());
  cudaDeviceSynchronize();
  et = dtime_usec(et);
  std::cout << "Thrust time: " << et/(float)USECPSEC << "s" << std::endl;

// cuda method

  float *vals;
  int *keys;
  my_atomics *results;
  cudaMalloc(&keys, vsize*sizeof(int));
  cudaMalloc(&vals, vsize*sizeof(float));
  cudaMalloc(&results, ksize*sizeof(my_atomics));

  cudaMemset(results, 0, ksize*sizeof(my_atomics)); // works because vals are all positive
  cudaMemcpy(keys, h_keys, vsize*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(vals, h_vals, vsize*sizeof(float), cudaMemcpyHostToDevice);
  et = dtime_usec(0);

  my_max_idx<<<(vsize+nTPB-1)/nTPB, nTPB>>>(vals, keys, vsize, results);
  cudaDeviceSynchronize();
  et = dtime_usec(et);
  std::cout << "CUDA time: " << et/(float)USECPSEC << "s" << std::endl;

// verification

  my_atomics *h_results = new my_atomics[ksize];
  cudaMemcpy(h_results, results, ksize*sizeof(my_atomics), cudaMemcpyDeviceToHost);
  for (int i = 0; i < ksize; i++){
    if (h_results[i].floats[0] != d_vals_out[i]) {std::cout << "value mismatch at index: " << i << " thrust: " << d_vals_out[i] << " CUDA: " << h_results[i].floats[0] << std::endl; return -1;}
    if (h_results[i].ints[1] != d_idxs_out[i]) {std::cout << "index mismatch at index: " << i << " thrust: " << d_idxs_out[i] << " CUDA: " << h_results[i].ints[1] << std::endl; return -1;}
    }

  std::cout << "Success!" << std::endl;
  return 0;
}

$ nvcc -arch=sm_35 -o t1234 t1234.cu
$ ./t1234
Thrust time: 0.026593s
CUDA time: 0.002451s
Success!
$

Very fast solution for a limited range of integer values for keys. But the creator of a question in the comment added: "value is just a **float key(s)** from 0 to N." From the experience of advanced systems, in DBMS (MSSQL/Oracle...) for the grouping all types of values and keys usually use only two approaches: Ordered matching (sort by key + sorted group by key), and Hash Matching (hash table with min/max/sum... operations). Both can be implemented on CUDA. — Alex, Aug 17 '16 at 21:54
I took that to mean "value is just a float" (full stop) "key(s) range from 0 to N". "value is just a float key" doesn't make much sense to me since **key** and **value** are separate concepts. My proposed solution works for `int` keys from 0 to N, which seems to be exactly what OP was asking for. — Robert Crovella, Aug 17 '16 at 22:32

Thrust CUDA find maximum per each group(segment)

1 Answers1