I'm writing a program in which I need to:
- make a test on each pixel of an image
- if test result is TRUE I have to add a point to a point cloud
- if test result is FALSE, make nothing
I've already wrote a working code on CPU side C++. Now I need to speed it up using CUDA. My idea was to make some block/thread (one thread per pixel I guess) execute the test in parallel and, if the test result is TRUE, make the thread to add a point to the cloud.
Here comes my trouble: How can I allocate space in device memory for a Point cloud (using cudaMalloc or similar) if I don't know a priori the number of point that I will insert in the cloud?
Do I have to allocate a fixed amount of memory and then increasing it everytime the point cloud reach the limit dimension? Or is there a method to "dynamically" allocate the memory?