1

I'm passing a large 2D array (in C) to the device and determining all possible combinations. For example:

A = 
id  val1 val2
1   100  200
2   400  800

Combination = 
id1  id2  sumval1  sumval2
1    2    500      1000

Because of the size of the original array, storing and returning all possible combinations would not be possible. I would like to return all combinations where sumval1 > 500 and sumval2 > 1000.

How can I return just this subset of combinations to the host to be written to a file; given that I won't know how many combinations meet the conditions?

2 Answers2

3

Some possible approaches:

  1. Allocate (from the host) whatever space you have left in GPU memory for a buffer. If you exceed that, you weren't going to be able to pass all the combinations back in a single transfer anyway. (Which may lead you to use the solution proposed by mtk99).
  2. Dynamically allocate space as you need it on the device using in-kernel malloc. At the completion of your combination-creation, collect all the individual combinations into a single buffer created with malloc. Then pass the total size of this buffer, and the pointer to this buffer, back to the host. The host then allocates a new buffer of that size using cudaMalloc, and launches a kernel to copy the data from the buffer created with malloc to the buffer created with cudaMalloc. At the completion of this copy-kernel, the host can transfer the data back to the host from the buffer created with cudaMalloc.

I would suggest that 1 is probably the best approach without knowing anything else about what you are trying to do. In kernel malloc is not particularly fast when allocating large numbers of small allocations. Also, when using in-kernel malloc, note the default size limitation (8MB) which can be increased.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I like the paging idea. I will allocate array R for the results, but how should I index the array? In C I would i++ every time a result is entered. Will I run into issues with all threads using the same i? – user2936659 Nov 17 '14 at 16:39
  • Yes, if you have multiple threads trying to update a single buffer, there will be issues. My answer [here](http://stackoverflow.com/questions/21786495/cuda-kernel-returning-vectors/21788662#21788662) may be useful for having multiple threads push data into a single buffer. – Robert Crovella Nov 17 '14 at 16:40
0

You can page the results:

  • Create a fix result array (let's say Z items).

  • Return not only the results but the point where you stopped (last_id1, last_id2).

  • On the next call pass a new starting point (start_id1, start_id2) based on your last result.

You can use streams in order to keep the GPU loaded.

Based on this, you could even distribute the calculation using several GPUs.

Juan Leni
  • 6,982
  • 5
  • 55
  • 87
  • I like the paging idea. I will allocate array R for the results, but how should I index the array? In C I would i++ every time a result is entered. Will I run into issues with all threads using the same i? – user2936659 Nov 17 '14 at 16:09
  • You should do stream compaction, have a look at thrust::remove_if. Generate as many combinations as you can fit in your output array. Maybe you need the input to be a texture, so you can use the values in your predicate without copying them to the output. Run stream compaction and return last_id1 and lastid_2. Call the kernel many times with different starting points as you progress through the data. – Juan Leni Nov 17 '14 at 17:51