Compress "sparse data" with CUDA (CCL: connected component labeling reduction)

Question

I have a 5 million list of 32 bit integers (actually a 2048 x 2560 image) that is 90% zeros. The non-zero cells are labels (e.g. 2049, 8195, 1334300, 34320923, 4320932) that completely not sequential or consecutive in any way (it is the output of our custom connected component labeling CCL algorithm). I am working with a NVIDA Tesla K40, so I would love it if this needs any prefix-scan work, that it uses SHUFFLE, BALLOT or any of the higher CC features.

I don't need a full worked out example, just some advice.

To illustrate, here is one blog that was labeled by our CCL algorithm.

Labeled Blob

Other blobs will have a different unique label (e.g. 13282). But all will be surrounded by zeros, and be ellipsoid in shape. (We optimized our CCL for ellipsoids, that is why we don't use the libraries). But one side effect is that the blob labels are not consecutive numbers. We don't care what order they are numbered, but we want one blob labeled #1, and other labeled #2, and the last one to be be labeled #n, where n is the number of blobs in the image.

What do I mean Labled #1? I mean that all 2242 cells should be replaced with a 1. and all 13282 cells would have a #2, etc.

The maximumb blob number from our CCL is equal to 2048x2560. So we know the size of the array.

Actually, Robert Crovella already gave a great answer for this a day ago. It was not exact, but I now see how to apply the answer. So I don't need any more help. But he was so generous in his time and effort, and asked me to re-write the problem with examples, so I did that.

score 4 · Accepted Answer · edited May 23 '17 at 11:45

One possible approach would be to use a sequence of:

thrust::transform - to convert the input data to all 1 or 0:

0 27 42  0 18 99 94 91  0  -- input data
0  1  1  0  1  1  1  1  0  -- this will be our "mask vector"

thrust::inclusive_scan - to convert the mask vector into a progressive sequence:

0  1  1  0  1  1  1  1  0  -- "mask" vector
0  1  2  2  3  4  5  6  6  -- "sequence" vector

Another thrust::transform to mask off the non-increasing values:

0  1  1  0  1  1  1  1  0  -- "mask" vector
0  1  2  2  3  4  5  6  6  -- "sequence" vector
-------------------------
0  1  2  0  3  4  5  6  0  -- result of "AND" operation

Note that we could combine the first two steps with thrust::transform_inclusive_scan and then perform the third step as a thrust::transform with a slightly different transform functor. This modification allows us to dispense with the creation of the temporary "mask" vector.

Here's a fully worked example showing the "modified" approach using thrust::transform_inclusive_scan:

$ cat t635.cu
#include <iostream>
#include <stdlib.h>

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/transform_scan.h>
#include <thrust/generate.h>
#include <thrust/copy.h>


#define DSIZE 20
#define PCT_ZERO 40

struct my_unary_op
{
  __host__ __device__
  int operator()(const int data) const
  {
    return (!data) ?  0:1;}
};

struct my_binary_op
{
  __host__ __device__
  int operator()(const int d1, const int d2) const
  {
    return (!d1) ? 0:d2;}
};

int main(){

// generate DSIZE random 32-bit integers, PCT_ZERO% are zero
  thrust::host_vector<int> h_data(DSIZE);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < DSIZE; i++)
    if ((rand()%100)< PCT_ZERO) h_data[i] = 0;
    else h_data[i] %= 1000;
  thrust::device_vector<int> d_data = h_data;
  thrust::device_vector<int> d_result(DSIZE);
  thrust::transform_inclusive_scan(d_data.begin(), d_data.end(), d_result.begin(), my_unary_op(), thrust::plus<int>());
  thrust::transform(d_data.begin(), d_data.end(), d_result.begin(), d_result.begin(), my_binary_op());
  thrust::copy(d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(), d_result.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}

$ nvcc -o t635 t635.cu
$ ./t635
0,886,777,0,793,0,386,0,649,0,0,0,0,59,763,926,540,426,0,736,
0,1,2,0,3,0,4,0,5,0,0,0,0,6,7,8,9,10,0,11,
$

Responding to the update, this new information makes the problem more difficult to solve, in my opinion. Histogramming techniques come to mind, but without any limits on the occupied range of the 32 bit integers (labels) or any limits on the number of times a particular label may be duplicated within the data set, histogramming techniques seem impractical. This leads me to consider sorting the data.

An approach like this should work:

Use thrust::sort to sort the data.
Use thrust::unique to remove the duplicates.
The sorted data with duplicates removed now gives us our ordering for our output set [0,1,2, ...]. Let's call this our "map". We can use a parallel binary-search technique to convert each label in the original data set to it's mapped output value.

This process seems pretty "expensive" to me. I would suggest reconsidering the upstream labelling operation to see if it can be re-designed to produce a data set more amenable to efficient downstream processing.

Anyway here is a fully worked example:

$ cat t635.cu
#include <iostream>
#include <stdlib.h>

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/copy.h>


#define DSIZE 20
#define PCT_ZERO 40
#define RNG 10

#define nTPB 256

// sets idx to the index of the first element in a that is
// equal to or larger than key

__device__ void bsearch_range(const int *a, const int key, const unsigned len_a, unsigned *idx){
  unsigned lower = 0;
  unsigned upper = len_a;
  unsigned midpt;
  while (lower < upper){
    midpt = (lower + upper)>>1;
    if (a[midpt] < key) lower = midpt +1;
    else upper = midpt;
    }
  *idx = lower;
  return;
  }

__global__ void find_my_idx(const int *a, const unsigned len_a,  int *my_data, int *my_idx, const unsigned len_data){
  unsigned idx = (blockDim.x * blockIdx.x) + threadIdx.x;
  if (idx < len_data){
    unsigned sp_a;
    int val = my_data[idx];
    bsearch_range(a, val, len_a, &sp_a);
    my_idx[idx] = sp_a;
    }
}


int main(){

// generate DSIZE random 32-bit integers, PCT_ZERO% are zero
  thrust::host_vector<int> h_data(DSIZE);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < DSIZE; i++)
    if ((rand()%100)< PCT_ZERO) h_data[i] = 0;
    else h_data[i] %= RNG;
  thrust::device_vector<int> d_data = h_data;
  thrust::device_vector<int> d_result = d_data;
  thrust::sort(d_result.begin(), d_result.end());
  thrust::device_vector<int> d_unique = d_result;
  int unique_size = thrust::unique(d_unique.begin(), d_unique.end()) - d_unique.begin();
  find_my_idx<<< (DSIZE+nTPB-1)/nTPB , nTPB >>>(thrust::raw_pointer_cast(d_unique.data()), unique_size, thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d_result.data()), DSIZE);

  thrust::copy(d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(), d_result.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}
$ nvcc t635.cu -o t635
$ ./t635
0,6,7,0,3,0,6,0,9,0,0,0,0,9,3,6,0,6,0,6,
0,2,3,0,1,0,2,0,4,0,0,0,0,4,1,2,0,2,0,2,
$

This is a great answer, and an excellent write up. The fault is mine. I left out a point that was obvious to me about CCL, but which makes this solution not work (at least without some modification). See the update. — Dr.YSG, Mar 02 '15 at 01:07
I could not easily find out: does Thrust use SHUFFLE for reductions and prefix scans? — Dr.YSG, Mar 02 '15 at 01:28
I don't know if thrust uses shuffle operations for reductions and prefix scans. (Thrust is an open-source library, a simple grep would probably answer the question.) Certainly they can't be used on devices of less than 3.0 compute capability. I don't see how that question in any way relates to the problem you've described. Regarding the original problem, with your update, the solution techniques I can think of involve histogramming as one of the steps. Is there any limit to (1.) the occupied range of your 32-bit integers or (2.) the number of duplicates possible in a single bin? — Robert Crovella, Mar 02 '15 at 02:13
I suppose my updated answer still isn't quite right since the output data are not "monotonically increasing". Not sure how that can be accomplished given the requirement of duplicates mapping to the same value. Rather than any additional verbiage, your problem statement might be enhanced with a sample input and desired output data set. — Robert Crovella, Mar 02 '15 at 03:08
Robert, I really did not expect full worked out answers. That would be lazy, All I wanted was some tips. I now see how I can apply your first solution to my problem (but I need the full 3 passes). But you have been so generous with your time, I figured a full re-write of the question would help others. As for me, I think I grok how I can do this with prefix-scan and SHUFL. I don't need thrust. Thank you for your help. — Dr.YSG, Mar 02 '15 at 19:43
@RobertCrovella sorry to necro this thread, I'm curious as to why you suggested the custom binary search method, perhaps I've misunderstood. It seems like a simple `thrust::lower_bound` where the search list is the unique label list from step 2, and the values to search for are the original labels. — nitronoid, Apr 15 '19 at 13:09
Isn't `thrust::lower_bound` a binary search? I don't remember what I was thinking from 4 years ago. Feel free to provide an answer if you have something you'd like to share. — Robert Crovella, Apr 15 '19 at 13:55

score 0 · Answer 2 · answered May 10 '19 at 18:53

My answer is similar to the one @RobertCrovella gave, but I think it's simpler to use thrust::lower_bound instead of the custom binary search. (now that it's pure thrust, the backend can be interchanged)

Copy input data
Sort the copied data
Create a unique list from the sorted data
Find the lower bound of each input, in the unique list

I have included a full example below. Interestingly the process can become quicker by pre-pending the sort step, with another call to thrust::unique. Depending on the input data, this can dramatically reduce the number of elements in the sort, which is the bottleneck here.

#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>

int main()
{
  const int ndata = 20;
  // Generate host input data
  thrust::host_vector<int> h_data(ndata);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < ndata; i++)
  {
    if ((rand() % 100) < 40)
      h_data[i] = 0;
    else
      h_data[i] %= 10;
  }

  // Copy data to the device
  thrust::device_vector<int> d_data = h_data;
  // Make a second copy of the data
  thrust::device_vector<int> d_result = d_data;
  // Sort the data copy
  thrust::sort(d_result.begin(), d_result.end());
  // Allocate an array to store unique values
  thrust::device_vector<int> d_unique = d_result;
  {
    // Compress all duplicates
    const auto end = thrust::unique(d_unique.begin(), d_unique.end());
    // Search for all original labels, in this compressed range, and write their
    // indices back as the result
    thrust::lower_bound(
      d_unique.begin(), end, d_data.begin(), d_data.end(), d_result.begin());
  }

  thrust::copy(
    d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(),
               d_result.end(),
               std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}

Compress "sparse data" with CUDA (CCL: connected component labeling reduction)

2 Answers2