Generate Cartesian Product using more than two lists on GPU

Question

I would like to know how to generate a Cartesian product of more than two lists using CUDA.

How do I make this code work with three or more lists?

It works with two lists but not with three, I tried /, % without success.

Its basic.

#include <thrust/device_vector.h>
    #include <thrust/pair.h>
    #include <thrust/copy.h>
    #include <iterator>
    
    __global__ void cartesian_product(const int *a, size_t a_size,
                                      const int *b, size_t b_size,
                                      const int *c, size_t c_size)
    {
      unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
      if(idx < a_size * b_size * c_size) 
      {
        unsigned int a_idx = idx / a_size;
        unsigned int b_idx = idx % a_size;
        
        // ? 
        unsigned int c_idx = idx % a_size;
    
    
        
        printf("a[a_idx] and b[b_idx] and c[c_idx] are: %d %d %d\n\n",a[a_idx], b[b_idx], c[c_idx]);
        //1 3 5 , 1 3 6 , 1 4 5 , 1 4 6 , 2 3 5 , 2 3 6 , 2 4 5 , 2 4 6  
        //0 0 0 , 0 0 1 , 0 1 0 , 0 1 1 , 1 0 0 , 1 0 1 , 1 1 0 , 1 1 1
      }
    }
    
    int main()
    {
      
      
      // host_vector is stored in host memory while device_vector livesin GPU device memory.
      // a has storage for 2 integers
      thrust::device_vector<int> a(2);
      
      // initialize individual elements
      a[0] = 1; 
      a[1] = 2; 
     
    
      // b has storage for 2 integers
      thrust::device_vector<int> b(2);
      
      // initialize individual elements
      b[0] = 3; 
      b[1] = 4; 
     
    
       // d has storage for 2 integers
      thrust::device_vector<int> c(2);
      
      // initialize individual elements
      c[0] = 5; 
      c[1] = 6;
      
       
      unsigned int block_size = 256;
      unsigned int num_blocks = (8 + (block_size - 1)) / block_size;
    
    
      // raw_pointer_cast creates a "raw" pointer from a pointer-like type, simply returning the wrapped pointer, should it exist.
      cartesian_product<<<num_blocks, block_size>>>(thrust::raw_pointer_cast(a.data()), a.size(),
                                                    thrust::raw_pointer_cast(b.data()), b.size(),
                                                    thrust::raw_pointer_cast(c.data()), c.size());

      
      
      return 0;
    }

How do I get the right c_idx in the kernel and subsequent arrays if I want more than three lists?

paleonix · Accepted Answer · 2021-01-18T14:58:38.820

2

It seems to me that you want "lexic indexing":

idx == (a_idx * b_size + b_idx) * c_size + c_idx

So you get your indices like this:

c_idx = idx % c_size;
b_idx = (idx / c_size) % b_size;
a_idx = (idx / c_size) / b_size;

This is easily generalizable to more dimensions. E.g. in four dimensions you have

idx == ((a_idx * b_size + b_idx) * c_size + c_idx) * d_size + d_idx

Then:

d_idx = idx % d_size;
c_idx = (idx / d_size) % c_size;
b_idx = ((idx / d_size) / c_size) % b_size;
a_idx = ((idx / d_size) / c_size) / b_size;

In C/C++ programming one likes to use this to calculate indices into an one-dimensional dynamic array representing a multidimensional dataset. In CUDA you normally don't need it as much, as CUDA gives you up to three-dimensional threadIdx/blockIdx/etc.. So for the Cartesian product of three arrays, you won't need this technique, but could just use the intrinsic CUDA features. Even in more than three the most performant solution will get two indices from two of the three dimensions of the kernel and use lexic indexing on the third one:

__global__ void cartesian_product_5d(const int *a, size_t a_size,
                                  const int *b, size_t b_size,
                                  const int *c, size_t c_size,
                                  const int *d, size_t d_size,
                                  const int *e, size_t e_size)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int d_idx = blockIdx.y * blockDim.y + threadIdx.y;
    int e_idx = blockIdx.z * blockDim.z + threadIdx.z;
    /* idx == (c_idx * b_size + b_idx) * a_size + a_idx */
    int a_idx = idx % a_size;
    int b_idx = (idx / a_size) % b_size;
    int c_idx = (idx / a_size) / b_size;

    /* ... */
}
 
int main()
{
    /* ... */
    dim3 threadsPerBlock(8, 8, 8);
    dim3 numBlocks((a_size + b_size + c_size + threadsPerBlock.x - 1) /
                   threadsPerBlock.x,
                   (d_size + threadsPerBlock.y - 1) / threadsPerBlock.y,
                   (e_size + threadsPerBlock.z - 1) / threadsPerBlock.z);
    cartesian_product_5d<<<numBlocks, threadsPerBlock>>>(/* ... */);
    /* ... */
}

edited Jan 18 '21 at 14:58

answered Jan 18 '21 at 13:08

paleonix

2,293
1
13
29

@markpilsur I added some CUDA-specific optimization. Please consider accepting the answer. – paleonix Jan 18 '21 at 13:28
could you please include a cuda for more than three indexes? And how do I accept the answer? – mark pilsur Jan 18 '21 at 13:36
There should be a grey checkmark you can click: https://stackoverflow.com/help/someone-answers If you really need it, I can, but it is just a combination of the code that is already there: In the main you use `c_size + d_size` when defining the amount of threads. Then use lexic indexing to get `c_idx` and `d_idx` from what I have called `c_idx` in the CUDA code above. You just have to be cautious with the curse of dimensionality. When having many dimensions you might run into the problem of having more elements of the product than allowed threads. You can still use a loop then. – paleonix Jan 18 '21 at 14:22
Actually using the three cuda indexes is out of the question since I need 39 lists, so I will use your first example – mark pilsur Jan 18 '21 at 14:44
How long are these lists? – paleonix Jan 18 '21 at 14:57
Each list contain 2 records. e.g. a[0] = 1, a[1] = 2. – mark pilsur Jan 18 '21 at 15:05
I updated the example for 5D, but I doubt that this will do well on a GPU for d=39 and array sizes bigger than two or maybe three without a lot more thought put in. Division and Modulo operations are fairly expensive compared to addition and multiplication. So you should just get as many dimensions via the above methods as you need to have enough threads for the GPU to become effective (millions to trillions maybe). The rest should be done via loops inside the kernel which wont need these divisions. You probably can use shared memory to optimize reading from global memory as well. – paleonix Jan 18 '21 at 15:06
For 2 it could be possible without for loops (meaning that you are allowed to create that many threads), but I don't think that it will be very performant due to the divisions. – paleonix Jan 18 '21 at 15:08
Although, I might be wrong. If the compiler knows that the size variables are all two he can use bit shifts instead of division which is quite effective. So you have to make these variables available at compilation time. – paleonix Jan 18 '21 at 15:12
yes, this is for 4 lists: unsigned int e_idx = idx & 1; unsigned int d_idx = (idx >> __ffs(2)-1) & 1; unsigned int c_idx = ((idx >> __ffs(2)-1) >> __ffs(2)-1) & 1; unsigned int b_idx = (((idx >> __ffs(2)-1) >> __ffs(2)-1) >> __ffs(2)-1) & 1; unsigned int a_idx = (((idx >> __ffs(2)-1) >> __ffs(2)-1) >> __ffs(2)-1) >> __ffs(2)-1; – mark pilsur Jan 18 '21 at 15:22
I do not think that I can use x,y and z cuda threads, since you can only have (1024,1,1) as threadsperblock, 2^39 = 549.755.813.888 – mark pilsur Jan 18 '21 at 15:28
You can also different combinations as long as the product stays below 1024. But it might not be an optimization anymore, now that I know that every dimension is only two big. – paleonix Jan 18 '21 at 15:33
So, will cartesian_product_5d work if it will became cartesian_product_39d, therefore 549.755.813.888 combinations or will I be limited by the thredsperblock? – mark pilsur Jan 18 '21 at 15:39
I think it should work, but you probably should try to not safe all those indices in integer variables. You may fill up too many registers for the Kernel to properly utilize the GPU. So maybe reuse variables (having 39 local variables for the actual elements is naturally also bad, but it is less probable that you are doing that). – paleonix Jan 18 '21 at 17:37
A cartesian product of 39 length 2 lists will contain 21440476741632 numbers. Presuming they are ints, as in the sample code in the question, that is about 86000 GB of storage for the results array. – talonmies Jan 19 '21 at 00:22
@talonmies, is not 39 length, is 39 lists of 2 items for list, which gives 2^39 = 549.755.813.888 – mark pilsur Jan 19 '21 at 07:14
the cartesian product of 39 lists of length 2 is 2^39 lists. And each list has 39 entries. and 4 bytes per entries. That is 86TB of memory – talonmies Jan 19 '21 at 07:47
@talonmies, you are wrong, I said 39 lists of 2 items. E.g. a1-a39, a1[0],a1[1]........a39[0],a39[1].......2^39 = 549.755.813.888 – mark pilsur Jan 19 '21 at 08:05
I am not. Count the number of entries in the cartesian product of your 3 lists of 2 items example in your question. You get 2^3 = 8 lists . 3 entries per list. 24 numbers. i.e. the total number of entries to store is n * 2^n = 3 * 2^3 = 3 * 8 = 24, 39 lists of 2 is 39 * 2^39 = 21440476741632 numbers. – talonmies Jan 19 '21 at 08:29
Noooooo! 2^3 means 3 lists of 2 entries, that is how I have done it. Again list a1 to a39, with only index of [0] and [1] and that means 2^39 -> 2 entries to the power of 39 lists. – mark pilsur Jan 19 '21 at 08:36
I think the point were you two differ is how the elements of each combination will be operated on. I think OP wants to e.g. multiply them, so just one element comes as output from the thread. @talonmies wants to output a list of all these elements of a combination. – paleonix Jan 19 '21 at 19:00
Yes, my example (under printf) clearly shows the output and their index for 2^3. – mark pilsur Jan 20 '21 at 06:58
1

Well it shows lists of numbers. If you want to save all of these, e.g. `1 3 5` from the first thread, @talonmies is right. My interpretation was that you want to further operate on them and only save e.g. `1*3*5`, so `15` for the first thread, and so on. Which is it? – paleonix Jan 20 '21 at 11:19
No, is not 1*3*5, Please read the entire code line by line and then run it. It shows the combinations and indexes that you get if you run the code,a1[0]=1,a1[1]=2,a2[0]=3,a2[1]=4,a3[0]=5,a3[1]=6 so doing a what is called cartesian product and I do not know why is called that you get 1st combination -> 1 3 5 -> index 0 0 0, combination -> 1 3 6 -> index 0 0 1, combination 1 4 5 -> index 0 1 0 etc etc if no one is conviced run that code and it will print 2^3 = 8 combinations. – mark pilsur Jan 20 '21 at 12:30
Furthermore is not a product what I want, I want to get the combinations (which I found out is called cartesian product) between/among lists. – mark pilsur Jan 20 '21 at 12:33
And how many numbers does it print? 24.... 3 * 2^3. Sound familiar? Eight combinations of 3 elements. And that is how many numbers you need to store to have all those combinations at your disposal – talonmies Jan 20 '21 at 13:03
But that is not what I am after, I am only interested in the combinations generated, that is why maybe I misunderstood you. – mark pilsur Jan 20 '21 at 13:06
furthermore I do not store those numbers, I use them on the fly. What I needed is just a combinations of 78 numbers -> 39(lists)x2(numbers each) – mark pilsur Jan 20 '21 at 13:09
Yes talonmies, you are very right, I have just misunderstood you because I was only interested in the combinations and nothing else. Sorry about that! – mark pilsur Jan 20 '21 at 13:15
I understood your code, but printing stuff out is nothing you do in production code, therefore I assumed you would operate on them and you seem to be doing that ("I use them on the fly."). Multiplication was only an example. – paleonix Jan 20 '21 at 18:01

Generate Cartesian Product using more than two lists on GPU

1 Answers1