Concurrent Writing CUDA

Question

I am new to CUDA and I am facing a problem with a basic projection kernel. What I am trying to do is to project a 3D point cloud into a 2D image. In case multiple points project to the same pixel, only the point with the smallest depth (the closest one) should be written on the matrix.

Suppose two 3D points fall in an image pixel (0, 0), the way I am implementing the depth check here is not working if (depth > entry.depth), since the two threads (from two different blocks) execute this "in parallel". In the printf statement, in fact, both entry.depth give the numeric limit (the initialization value).

To solve this problem I thought of using a tensor-like structure, each image pixel corresponds to an array of values. After the array is reduced and only the point with the smallest depth is kept. Are there any smarter and more efficient ways of solving this problem?

__global__ void kernel_project(CUDAWorkspace* workspace_, const CUDAMatrix* matrix_) {
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  if (tid >= matrix_->size())
    return;

  const Point3& full_point = matrix_->at(tid);
  float depth              = 0.f;
  Point2 image_point;
  // full point as input, depth and image point as output
  const bool& is_good = project(image_point, depth, full_point); // dst, dst, src
  if (!is_good)
    return;

  const int irow = (int) image_point.y();
  const int icol = (int) image_point.x();

  if (!workspace_->inside(irow, icol)) {
    return;
  }

  // get pointer to entry
  WorkspaceEntry& entry = (*workspace_)(irow, icol);
  // entry.depth is set initially to a numeric limit

  if (depth > entry.depth) // PROBLEM HERE
    return;
  
  printf("entry depth %f\n", entry.depth) // BOTH PRINT THE NUMERIC LIMIT

  entry.point = point;
  entry.depth = depth;
}

The tensor only works if you can bound the number of updates to a single pixel. If you are using 32 bit types, you could use a [custom atomic](https://stackoverflow.com/questions/17411493/how-can-i-implement-a-custom-atomic-function-involving-several-variables/17414007#17414007). Whether it would be faster than the tensor + reduction step, I cannot say. The custom atomic method will be painful if the number of collisions is high. Otherwise it may be something to consider/try. — Robert Crovella, Jun 14 '22 at 15:00
Thanks @RobertCrovella for your suggestion. I was trying to modify your custom atomic answer to return a `bool`, but does not work as expected. For instance inside `my_atomicMin` rather than returning the address I want to return a `bool`, if the min is updated or not. From what I understood if `val1 == loctest.floats[0]` is true, the min is updated. However, is not working as expected. Am I interpreting this in the wrong way? — ldg, Jun 17 '22 at 09:29
1. an atomic doesn't return an address. it returns a value. 2. "if the min is updated or not" is something we would have to define carefully, before it would be evident what to do, or what could be done. 3. I can't really comment on the sensibility of code you haven't shown. 4. This doesn't appear to be what you were asking in this question. Your question seems to be asking "how do I get a properly reduced value into this location?" Your comment here seems to be asking "how do I know if a specific atomic update happened or not?" Those don't seem to be related to me. — Robert Crovella, Jun 17 '22 at 12:48

Concurrent Writing CUDA

0 Answers0