4

I would like to know a method to generate Cartesian product using CUDA on GPU.
Simple case: We have two lists:

 A = {0.0, 0.1, 0.2}   B = {0.0, 0.1, 0.2}
 A x B = C = { {0.0, 0.0}, {0.0, 0.1}, {0.0, 0.2}, {0.1, 0.0}, {0.1, 0.1} ...}

How can I generate (list of list) C in GPU? How can this be done for N lists with M values each.

The terminology that I am using might be incorrect. I can try explaining what I mean:
I am essentially trying to generate a truth table: a binary truth table would look like

A binary truth table would look like

A     B
0     0
0     1
1     0
1     1

where A has two values {0, 1} and B has {0, 1}. In my case A and B has more than two values, for starters 31 values (0 - 30). For every value in set A, I have 31 values in set B, I need to enumerate them and store them in memory.

Other than that, i need to extend the algorithm to N list instead of 2 lists (A and B)

fahad
  • 383
  • 1
  • 6
  • 16
  • Have you tried something? – sgarizvi Apr 24 '13 at 15:47
  • Generating Cartesian product on CPU, can simply be done via iterators or vectors of vectors. But I can't seen to pull out any idea, how this can be done in GPU. I thought about making a counter; adding one to the least significant array index, and shifting the overflow with the left. But that too is not realizable (atleast for me on GPU) – fahad Apr 24 '13 at 15:52
  • 2
    The [cartesian product](http://en.wikipedia.org/wiki/Cartesian_product) of 2 sets that I'm familiar with produces a set of *ordered pairs*. What rule do you use to create a set of triples from 2 sets? Is this a homework question? – Robert Crovella Apr 24 '13 at 16:00
  • 1
    A useful heuristic to start thinking about how to solve problems in parallel on the GPU is to come up with a *thread strategy*. What will each thread do? In algorithms that produce a large amount of output data points (such as this one), a common thread strategy is to have each thread be responsible for producing one output point (let's say ordered pair for the case where I have 2 sets). If set A is of size a, and set B is of size b, then I know that I need a*b threads. A 2D array of threads immediately comes to mind, where each thread will choose one element from each of the 2 input sets – Robert Crovella Apr 24 '13 at 16:28
  • Hi Robert, Thanks for the reply, it has at least given me starting point to think about in parallel. No, this is not a home work assignment. You are right, about the Cartesian product. I have fixed my mistake in the question. In my actual problem, I have atleast 3 lists, so i got carried away while writing down this question. If i see how it works for 2 lists, perhaps I can think of how can I extend it. – fahad Apr 24 '13 at 17:37
  • But now since I have been working on this for few hours, I see, that each thread will not actually take one value from each list. Each thread should take one value from list A and then one value from list B. Then the next thread should again take the same value from list A and next value of list B. And this procedure must exhaust all values of B before the new thread could take the next value from list A and repeat the procedure again. Any pointers on that? – fahad Apr 24 '13 at 19:52

1 Answers1

1

I don't claim this is efficient; just functional:

#include <thrust/device_vector.h>
#include <thrust/pair.h>
#include <thrust/copy.h>
#include <iterator>

__global__ void cartesian_product(const int *a, size_t a_size,
                                  const int *b, size_t b_size,
                                  thrust::pair<int,int> *c)
{
  unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

  if(idx < a_size * b_size)
  {
    unsigned int a_idx = idx / a_size;
    unsigned int b_idx = idx % a_size;

    c[idx] = thrust::make_pair(a[a_idx], b[b_idx]);
  }
}

int main()
{
  thrust::device_vector<int> a(3);
  a[0] = 0; a[1] = 1; a[2] = 2;

  thrust::device_vector<int> b(3);
  b[0] = 0; b[1] = 1; b[2] = 2;

  thrust::device_vector<thrust::pair<int,int> > c(a.size() * b.size());

  unsigned int block_size = 256;
  unsigned int num_blocks = (c.size() + (block_size - 1)) / block_size;

  cartesian_product<<<num_blocks, block_size>>>(thrust::raw_pointer_cast(a.data()), a.size(),
                                                thrust::raw_pointer_cast(b.data()), b.size(),
                                                thrust::raw_pointer_cast(c.data()));

  std::cout << "a: { ";
  thrust::copy(a.begin(), a.end(), std::ostream_iterator<int>(std::cout, ", "));
  std::cout << "}" << std::endl;

  std::cout << "b: { ";
  thrust::copy(b.begin(), b.end(), std::ostream_iterator<int>(std::cout, ", "));
  std::cout << "}" << std::endl;

  std::cout << "c: { ";
  for(unsigned int i = 0; i < c.size(); ++i)
  {
    thrust::pair<int,int> x = c[i];
    std::cout << "(" << x.first << ", " << x.second << "), ";
  }
  std::cout << "}" << std::endl;

  return 0;
}

The program's output:

$ nvcc cartesian_product.cu -run
a: { 0, 1, 2, }
b: { 0, 1, 2, }
c: { (0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2), }
Jared Hoberock
  • 11,118
  • 3
  • 40
  • 76
  • 1
    Thanks Jared. This is indeed the cleanest approach that I could have wished for. Thanks again. – fahad Apr 24 '13 at 20:25