0

My task is to implement an image reconstruction algorithm code using CUDA. I am provided with a code in C for the same. The input to the code is a DAT file which contains 360 images of size 640 x 480.The code goes something like this:

    FILE *in,*out;
    float *i_data,*o_data;
    i_data=(float *)malloc(mem_size);
    for(int projection=0;projection<360;projection++)
    {
      in=fopen("filename.dat","rb");
      fread(i_data,mem_size,1,in);
      ... some math ...
      for(int slice_no=-240;slice_no<240:slice_no++)
      {
          for (i=-320;i<320;i++)
          for (j=-320;j<320;j++)
          {
             // do some operations
             (*(o_data*slice_no)+(j+320)+(i+240))+=(*(i_data*value)+(j+240)+(i+320));
             // some more math
          }
      } 
    }

The output float pointer is written back to a dat file. If I have to parallelize these loops, how would I do that in CUDA? I tried implementing the inner two for loops in CUDA using 640 blocks each of 640 threads. How do I give the thread index to the pointer operation inside the loop. I tried giving

         int i=blockIdx.x;
         int j=threadIdx.x;

     and 

         kernel<<<640,640>>>

But this gives wrong values in the output pointer. Most are NAN. Except the line with pointers shown in the above snippet, I was able to implement the other math successfully.

Could anyone please help me doing this? What is that I am doing wrong in this code? Also is it possible to parallelize all the for loops here?

Dhivya
  • 11
  • 6
  • I don't see how anyone could tell you what you are doing wrong in this code. The code you have shown, i.e. the assignment of `i` and `j`, and the (incomplete) kernel invocation are trivial, and not indicative of what you are doing. – Robert Crovella Jul 05 '13 at 19:02
  • @RobertCrovella: I think I had found what I was doing wrong. cudaMalloc was giving random values to the float array. This is the one creating problem. Could you help me how to initialize float array after cudaMalloc. I checked this answer[link](http://stackoverflow.com/questions/10589925/initialize-device-array-in-cuda). But that is only for int I guess. Could you tell how to initialize float array to zero. – Dhivya Jul 06 '13 at 09:04

1 Answers1

0

It's not efficient to constraint the inner loop to run within a single block. Also, you'll get much more benefit if you parallelize all three loops (or 4 loops if you can load more data from disk and send many images at once to the gpu).

Your problem looks like

for(x = minx; x < maxx; ++x)
{
  for(y = miny; y < maxy; ++y)
  {
    for(z = minz; z < maxz; ++z)
    {
      // do some math
    }
   }
}

The number of threads in your kernel should be:

num_thread = (maxx-minx) * (maxy-miny) * (maxz-minz);

You should be independent from the size of a block. Set it to a constant like blockDim.x = 256 (experiment to find a better constant).

The number of blocks will be

gridDim.x =(num_thread + blockDim.x) / blockDim.x

Note that I'm reasoning as if the three loops were unrolled into single big loop but you can make 2D and even 3D blocks of threads if your driver allows it.

In your kernel, from the built-in variables, you can compute the global index inside the big unrolled loop as

int index = blockIdx.x * blockDim.x + threadIdx.x

Make sure you're not out of range

if(index >= num_thread) return; // do nothing

Now using index you can recover x, y and z (slice_no, i, j) as

x = index / ( (maxy-miny) * (maxz-minz) ) + minx;
y = ( index % ( (maxy-miny) * (maxz-minz) ) ) / (maxz-minz) + miny;
z = ( index % ( (maxy-miny) * (maxz-minz) ) ) % (maxz-minz) + minz;
a.lasram
  • 4,371
  • 1
  • 16
  • 24
  • Thankyou for the reply. I will try implementing this. Could you please explain "i and j must be computed from the 1D value blockIdx.x*blockDim.x+threadIdx.x" Sorry I din't get how to do that. Suppose I set 640 blocks each of 640 threads, how do I set the i and j values in the kernel. – Dhivya Jul 06 '13 at 08:59
  • @Dhivya Imagine a 2D C array "int A[H][W]". "A[i][j]" is equivalent to "((int*)A)[k]" such as "k=i*W+j". From "k" you can recover "i=k/W" and "j=k%W". The 3D "x", "y" and "z" in my answer are computed from a 1D index using the same reasoning – a.lasram Jul 07 '13 at 00:15
  • Thanks for that explanation. I implemented this but the reconstruction is slow. Could you please tell me ways to improve the speed. I read on shared memory and texture memory. But I am not understanding how to implement it for these nested for loops. – Dhivya Jul 07 '13 at 11:36
  • Sorry for the trouble. Could you also please explain how this is calculated. gridDim.x =(num_thread + blockDim.x) / blockDim.x – Dhivya Jul 07 '13 at 15:11
  • @Dhivya - You would've understood it if gridDim.x= num_thread / blockDim.x; but it's more robust to be independent from blockDim.x and let blockDim.x to be any constant. However using any constant means that num_thread is not always divisible by blockDim.x and in that case adding blockDim.x to num_thread will add an extra block with some threads out of the range. The if(index >= num_thread) return; line prevents out of the range work. – a.lasram Jul 07 '13 at 21:40
  • @Dhivya - My answer contains code with div and mod operations executed once and should not hurt. In your C code inside the loops you're only showing a single line of code. If this is the only code in your kernel then you're doing too much work setting up buffers and sending them to the gpu. More work on the gpu side is needed to feel some speedup. Your app also has many disk-io that prevents feeling the speedup. – a.lasram Jul 07 '13 at 21:40
  • Thankyou for explaining. I tried your method. But I am getting cuda error 'Launch timed out and was terminated' after few iterations. Could you please help me solve this. I am using GT610 graphic card, Visual Studio 2010 on Windows XP – Dhivya Jul 08 '13 at 09:29
  • @Dhivya You're not showing enough kernel code to diagnose the issue. The kernel is allowed to execute for a limited amount of time. This amount is a parameter in the driver. If your kernel is too long you need to consider breaking it into smaller pieces. Increasing the time out can be a good solution in rare cases. Try nvidia parallel insight to profile your kernel and possibly disable the time out – a.lasram Jul 08 '13 at 23:19
  • Solved it. Thanks a lot for your help! :-) – Dhivya Jul 09 '13 at 08:39