-2

Solved: Sorry, it's my fault, I should use atomicAdd(times,1); instead of *times++ in the kernel function.

I call the kernel function like this

dim3 Dg(blockSize, blockSize, blockSize);
dim3 Db(8, 8, 8);
voxelize << < Dg, Db >> > ();
cudaDeviceSynchronize();

But I found that my program only solve the part of the problem, so I use printf() in my global function voxelize () like the following code

__global__ void voxelize(){
    printf("the thread blockIdx.x %d, blockIdx.y %d blockIdx.z %d\n", blockIdx.x, blockIdx.y, blockIdx.z);
    unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
    unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
    unsigned int zIndex = blockDim.z * blockIdx.z + threadIdx.z;
    unsigned int i = zIndex * blockDim.x*blockDim.y+ yIndex * blockDim.x+ xIndex;
}    

The output showed only the last part of each dimension runned( that is, the blockIdx.x is always 5, only some of the blockIndex.z are changing from 0 to 5).But I don't understand why, is there anything wrong when I call this kernel function? My computer is with the GTX1050Ti MaxQ and cuda 10.


After, I passed a pointer to the kernel to monitor the running times.

 int blockSize = ceil(pow(triangles.size() 69664 / 512.0, 1.0 / 3));
 dim3 Dg(blockSize, blockSize, blockSize);
 dim3 Db(8, 8, 8);
 int* times = new int(0);
 int* gpu_times;
 cudaMalloc((void **)&gpu_times, sizeof(int));
 cudaMemcpy(gpu_times, times, sizeof(int), cudaMemcpyHostToDevice);
 voxelize << < Dg, Db >> > (gpu_times);
 cudaDeviceSynchronize();
 cudaMemcpy(times, gpu_times, sizeof(int), cudaMemcpyDeviceToHost);
 std::cout << *times << std::endl;

the kernel is modified as

__global__ void voxelize(int* times){
    (*times)++;
    printf("the thread blockIdx.x %d, blockIdx.y %d blockIdx.z %d\n", blockIdx.x, blockIdx.y, blockIdx.z);
    unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
    unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
    unsigned int zIndex = blockDim.z * blockIdx.z + threadIdx.z;
    unsigned int i = zIndex * blockDim.x*blockDim.y+ yIndex * blockDim.x+ xIndex;
}    

the output is enter image description here the output shows it runs 141 times, but in fact, the output should be far more than 69664


sorry, it's my fault, I should use atomicAdd(times,1); instead of *times++.

But why does printf() only output a part of the index as I described before?

Forsworn
  • 112
  • 1
  • 10

1 Answers1

0

For your printf problem

You need to call cudaDeviceSynchronize() (error checking omitted for clarity)and you also need cudaDeviceSetLimit(...) if you use a lot of printf (which is the case):

#include <stdio.h>

__global__ void voxelize(){
    printf("the thread blockIdx.x %d, blockIdx.y %d blockIdx.z %d\n", blockIdx.x, blockIdx.y, blockIdx.z);
    unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
    unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
    unsigned int zIndex = blockDim.z * blockIdx.z + threadIdx.z;
    unsigned int i = zIndex * blockDim.x*blockDim.y+ yIndex * blockDim.x+ xIndex;
}

int main()
{
  // Increase device printf buffer to 50 MiB
  cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 50*1024*1024);
  dim3 Dg(5, 5, 5);
  dim3 Db(8, 8, 8);
  voxelize<<<Dg, Db>>>();
  cudaDeviceSynchronize();

  return 0;
}

This will print something like:

the thread blockIdx.x 2, blockIdx.y 3 blockIdx.z 4
the thread blockIdx.x 2, blockIdx.y 3 blockIdx.z 4
the thread blockIdx.x 2, blockIdx.y 3 blockIdx.z 4
the thread blockIdx.x 2, blockIdx.y 3 blockIdx.z 4
the thread blockIdx.x 2, blockIdx.y 3 blockIdx.z 4
[...]

You can then check it like this:

# This will keep one line per block and count them, so 5*5*5 == 125
$ ./a.out | sort | uniq | wc -l
125

# This will output one line per thread and count them, so 5*5*5 * 8*8*8 == 64000
$ ./a.out | wc -l
64000

For you count problem

You can't do that: (*times)++;. You'll have a concurrency problems. You need to use atomic functions.

Robin Thoni
  • 1,651
  • 13
  • 22
  • I called it in my real codes, the output I wanted is `the thread blockIdx.x 0, blockIdx.y 3 blockIdx.z 4` `the thread blockIdx.x 1, blockIdx.y 3 blockIdx.z 4` `the thread blockIdx.x 2, blockIdx.y 3 blockIdx.z 4` `the thread blockIdx.x 3, blockIdx.y 3 blockIdx.z 4` `the thread blockIdx.x 4, blockIdx.y 3 blockIdx.z 4` – Forsworn Nov 09 '18 at 11:34
  • maybe I misunderstands the cuda thread? I think that the kernel function will run #threads times, and the index of each thread can be calculated by blockIdx,blockDim,threadIdx, etc – Forsworn Nov 09 '18 at 11:39
  • You can refer to this gist to compute the thread index: https://gist.github.com/waltner/6888263df7bceaad9ffc8c1408f68e3c – Robin Thoni Nov 09 '18 at 11:45
  • Thank you, Robin. Now I know it's the my algorithm of homework rather than my machine which is wrong......crying :( . But I'm still wondering why it only outputs a part of the indices as I described above? – Forsworn Nov 09 '18 at 11:54
  • I don't really know, have you checked for CUDA errors? – Robin Thoni Nov 09 '18 at 13:31
  • I checked the CUDA errors for memory management, maybe I should go to read the doc more. I thought it was easy to just migrate voxelization code to gpu without too much pursuit of efficiency, but I was wrong. The result is a fragmented voxelized bunny model, that's why I ridiculously thought maybe some threads doesn't work. Anyway, thank you very much. – Forsworn Nov 09 '18 at 14:00
  • I was writing some CUDA code on a post card (don't ask...) when I understood your problem, you need to use `cudaDeviceSetLimit(...)`. I updated the answer with it. – Robin Thoni Nov 17 '18 at 11:52