Allocate CUDA device memory for a point cloud with increasing dimension (number of point)

Question

I'm writing a program in which I need to:

make a test on each pixel of an image
if test result is TRUE I have to add a point to a point cloud
if test result is FALSE, make nothing

I've already wrote a working code on CPU side C++. Now I need to speed it up using CUDA. My idea was to make some block/thread (one thread per pixel I guess) execute the test in parallel and, if the test result is TRUE, make the thread to add a point to the cloud.

Here comes my trouble: How can I allocate space in device memory for a Point cloud (using cudaMalloc or similar) if I don't know a priori the number of point that I will insert in the cloud?

Do I have to allocate a fixed amount of memory and then increasing it everytime the point cloud reach the limit dimension? Or is there a method to "dynamically" allocate the memory?

I wanted to comment but it was too long, see my "answer" to see how to dinamycally allocate from kernels. But it is not 100% what you want, as it will allocated in different arrays each time. — Taro, Apr 22 '16 at 10:26

Taro · Answer 1 · 2016-04-22T15:02:12.150

I would like to post this as a comment, as it only partially answers, but it is too long for this.

Yes, you can dynamically allocate memory from the kernels. You can call malloc() and free() within your kernels to dynamically allocate and free memory during computation, as explained in the B-16 section of the CUDA 7.5 Programming Guide :

__global__ void mallocTest()
{
    size_t size = 123;
    char* ptr = (char*)malloc(size);
    memset(ptr, 0, size);
    printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);
    free(ptr);
}

int main()
{
    // Set a heap size of 128 megabytes. Note that this must
    // be done before any kernel is launched.
    cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
    mallocTest<<<1, 5>>>();
    cudaDeviceSynchronize();
    return 0;
}

(You will need the compute capability 2.x or higher)

But by doing this you allocate a new and different buffer in memory, you don't make your previously - and allocated by the host - buffer "grow" like a CPU dynamic container (vector, list, etc).

I think you should set a constant setting the maximum size of your array, then allocating the maximum size, and making your kernel incrementing the "really used size" in this maximum buffer. If doing so, don't forget to make this increment atomic/synchronized to count each increment from each concurrent thread.

score 1 · Answer 2 · answered Apr 22 '16 at 11:45

When you allocate memory on the device, you may do so with two API calls: one is the malloc as described by Taro, but it is limited by some internal driver limit (8 MB by default), which can be increased by setting the appropriate limit with cudaDeviceSetLimit with parameter cudaLimitMallocHeapSize.

Alternately, you may use cudaMalloc within a kernel, as it is both a host and device API method.

In both cases, Taro's observation stands: you will allocate a new different buffer, as it would do on CPU by the way. Hence, using a single buffer might result in a need for a copy of data. Note that cudaMemcpy is not a device API method, hence, you may need to write your own.

To my knowledge, there is no such thing as realloc in the CUDA API.

Back to your original issue, you might want to implement your algorithm in three phases: First phase would count the number of samples you need, second phase would allocate the data array and third phase feed the data array. To implement this, you may use atomic functions to increment some int that counts the number of samples.

The idea of implementing in three phases seems interesting, and will avoid allocating a buffer bigger than needed (like with the workaround I suggested). I think @dbovo89 should give this a try :) — Taro, Apr 22 '16 at 12:01

Allocate CUDA device memory for a point cloud with increasing dimension (number of point)

2 Answers2