My task is to implement an image reconstruction algorithm code using CUDA. I am provided with a code in C for the same. The input to the code is a DAT file which contains 360 images of size 640 x 480.The code goes something like this:
FILE *in,*out;
float *i_data,*o_data;
i_data=(float *)malloc(mem_size);
for(int projection=0;projection<360;projection++)
{
in=fopen("filename.dat","rb");
fread(i_data,mem_size,1,in);
... some math ...
for(int slice_no=-240;slice_no<240:slice_no++)
{
for (i=-320;i<320;i++)
for (j=-320;j<320;j++)
{
// do some operations
(*(o_data*slice_no)+(j+320)+(i+240))+=(*(i_data*value)+(j+240)+(i+320));
// some more math
}
}
}
The output float pointer is written back to a dat file. If I have to parallelize these loops, how would I do that in CUDA? I tried implementing the inner two for loops in CUDA using 640 blocks each of 640 threads. How do I give the thread index to the pointer operation inside the loop. I tried giving
int i=blockIdx.x;
int j=threadIdx.x;
and
kernel<<<640,640>>>
But this gives wrong values in the output pointer. Most are NAN. Except the line with pointers shown in the above snippet, I was able to implement the other math successfully.
Could anyone please help me doing this? What is that I am doing wrong in this code? Also is it possible to parallelize all the for loops here?