cuda convolution mapping

Question

i'm trying to copy for each block of threads a patch of image and relative apron to shared memory.

After my data are copyied(i used a matrix) to shared memory, i want a relations that map the center of the mask in shared memory that i consider for convolution and the center of the mask in the image buffer.

I want that because if i try to do convolution of image seems that the center of the mask in shared memory doesn't correspond to the center in the image buffer stored in global memory.

In the code below i write an example of simple image black and white erosion algorithm , when i put the result of a convolution to the output image seems that the center not corresponds.

i write my sample using a 512x512px image

my settings are:

//block and grid size
dim3 block(16,16);
dim3 grid(512/(block.x),512/(block.y),1);

this is my kernel:

#define STREL_SIZE 5

#define TILE_W 16
#define TILE_H 16

#define R (STREL_SIZE/2)

//size of tile image + apron
#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))


 __global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){

    // Data cache: threadIdx.x , threadIdx.y
    __shared__ unsigned char data[TILE_W +STREL_SIZE  ][TILE_H +STREL_SIZE ];


     int col = blockIdx.x * blockDim.x + threadIdx.x;
     int row = blockIdx.y * blockDim.y + threadIdx.y;

     // global mem address of this thread
     int gLoc =  row*w +col;


     int x, y;  // image based coordinate



     if((col<w)&&(row<h)) {
         data[threadIdx.x][threadIdx.y]=buffer_in[gLoc];

     if (threadIdx.y > (h-STREL_SIZE))
          data[threadIdx.x][threadIdx.y + STREL_SIZE]=buffer_in[gLoc + STREL_SIZE];

     if (threadIdx.x >(w-STREL_SIZE))
          data[threadIdx.x + STREL_SIZE][threadIdx.y]=buffer_in[gLoc+STREL_SIZE];

     if ((threadIdx.x >(w-STREL_SIZE)) && (threadIdx.y > (h-STREL_SIZE)))
          data[threadIdx.x+STREL_SIZE][threadIdx.y+STREL_SIZE] =     buffer_in[gLoc+2*STREL_SIZE];

     //wait for all threads to finish read
     __syncthreads();

      unsigned char min_value = 255;
      for(x=0;x<STREL_SIZE;x++){
          for(y=0;y<STREL_SIZE;y++){
              min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
              }

          }
      buffer_out[gLoc]= min_value;
      }
}

my input image:

my the output of the kernel is:

where w is the width of image,and is equal 512, where h is the height of image,and is equal 512.

i call the kernel with:

 erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);

the dimage_src is the input image an array buffer not a matrix, and dimage_dst is the output image a buffer.

each buffer have the size of nElem * nImg * sizeof(unsigned char) where nElem=512*512 is the size of the buffer and nImg is the number of image that i want processing in my case is equal to 1. where i'm wrong?

CODE UPDATE:

__global__ void erode_multiple_img_SM_v2(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){

// Data cache: threadIdx.x , threadIdx.y
__shared__ unsigned char data[TILE_W + STREL_SIZE-1 ][TILE_H + STREL_SIZE-1 ];

// global mem address of this thread
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;



int gLoc =  row*w +col;



// each threads loads four values from global memory into shared mem
int x, y;   // image based coordinate



if((col<w)&&(row<h)) {

    data[threadIdx.x][threadIdx.y] = buffer_in[gLoc];

     if (threadIdx.y > (TILE_H-STREL_SIZE+1))
          data[threadIdx.x][threadIdx.y + STREL_SIZE-1]=buffer_in[(row + STREL_SIZE-1)*w + col];

     if (threadIdx.x > (TILE_W-STREL_SIZE+1))
           data[threadIdx.x + STREL_SIZE-1][threadIdx.y] = buffer_in[row*w+col + STREL_SIZE-1];

     if ((threadIdx.x > (TILE_W-STREL_SIZE+1)) && (threadIdx.y > (TILE_H-STREL_SIZE+1)))
           data[threadIdx.x + STREL_SIZE-1][threadIdx.y + STREL_SIZE-1] = buffer_in[(row + STREL_SIZE-1)*w + col + STREL_SIZE-1];

    //wait for all threads to finish read
     __syncthreads();



      unsigned char min_value = 255;
      for(x=0;x<STREL_SIZE;x++){
          for(y=0;y<STREL_SIZE;y++){
              min_value = min( (data[threadIdx.x+x][threadIdx.y+y]) , min_value);
              }

          }
      buffer_out[gLoc]= min_value;
      }

    }

my output now is:

UPDATED 2(version 2 -working-):

i have implemented another version of algorithm.To do that i follow that slide that i found very useful and well explained,in particular the part in wich the author talk about convolution slide 27.

i change the block and grid settings to :

dim3 block(20,20);
dim3 grid(512/(block.x)+ block.x,512/(block.y)+block.y);

the kernel call instead ramain the same:

erode_multiple_img_SM<<<grid,block>>>(dimage_src,dimage_dst,512,512);

where the argument of the kernel are:

dimage_src: buffer of unsigned char with size height x width that contain input image.
dimage_dst:**buffer of unsigned char with size **height x width that contain output image, that my kernel produced.
512: the third argument is the width of the image.
512: the fourth argument is the height of the image.

remember my image sample are black and white but this version of erosion can work with grayscale too.

here my working Kernel:

#define STREL_W 5
#define STREL_H 5

#define STREL_SIZE 5


#define TILE_W 16
#define TILE_H 16

#define R (STREL_SIZE/2)


#define BLOCK_W (TILE_W+(2*R))
#define BLOCK_H (TILE_H+(2*R))

__global__ void erode_multiple_img_working(unsigned char * buffer_in,
                            unsigned char * buffer_out,
                            int w,int h ){


    __shared__ unsigned char fast_acc_mat[BLOCK_w][BLOCK_H];

    int ty = threadIdx.y;
    int tx = threadIdx.x;


    int row_o = blockIdx.y * TILE_W + ty;
    int col_o = blockIdx.x * TILE_H + tx;


    int row_i = row_o - R;
    int col_i = col_o - R;

    //in of img size
    if((row_i >= 0) && (row_i < h) && (col_i >= 0) && (col_i < w) ){

        fast_acc_mat[ty][tx] = buffer_in[ row_i * w + col_i];

    }
    else{

        fast_acc_mat[ty][tx] = 0;

    }


    __syncthreads();





    if( ty < TILE_H && tx < TILE_W ){

        unsigned char min_val=255;
        for(int i = 0; i < STREL_SIZE; i++) {
            for(int j = 0; j < STREL_SIZE; j++) {

                min_val = min( fast_acc_mat[i+ty][j+tx] , min_val );

            }
        }
        if(row_o < h && col_o < w)
                buffer_out[row_o * w + col_o] = min_val;

        }

     }

and this is my eroded image(output):

I realized a scheme that show how the part of the algorithm described by Eric load pixel of a TILE in shared memory :

i can't put all the code inside the post becaus it too big. but i also put the kernel call if you think that help to read the problem? — userfi, Jun 29 '16 at 12:20
Maybe you should also be more precise on what you want to accomplish. You code is not really self-explaining. — havogt, Jun 29 '16 at 12:35
The code belongs in the question, not in an external link. SO is [clear about this](http://stackoverflow.com/help/on-topic) ("**shortest** code... **in the question itself**). If the code is too long for the question, then you have not created a proper [mcve]. — Robert Crovella, Jun 29 '16 at 13:26
sorry, i think that the question is well exposed. i removed the link — userfi, Jun 29 '16 at 13:32
you still miss two "}" at the end. do you really want to make your code work? — kangshiyin, Jun 29 '16 at 17:04

score 2 · Accepted Answer · edited Jul 01 '16 at 13:19

2

You need only [20][20] shared mem, rather than [21][21]. It should be changed to

__shared__ unsigned char data[TILE_W + STREL_SIZE-1][TILE_H + STREL_SIZE-1];

Another problem is the data loading. The correct way is to read (16+4) x (16+4) pixels from input to share memory, using (16 x 16) threads collaboratively. This can be divided into 4 parts:

1)first part: thread(0:15, 0:15) load pixels (0:15,0:15)

2)second part: thread(0:15,12:15) load pixels (0:15, 16:19)

3)third part: thread(12:15,0:15) load pixels (16:19,0:15)

4)fourth part: thread(12:15,12:15) load pixels (16:19,16:19)

But in your code you are messing up the indexing. For part 2~4, only some of the threads in the thread block will be working, and additional boundary checking is also required.

For the 2nd part, you should use thread(0:15, 12:15) to read pixel(0:15, 16:19) as

 if (threadIdx.y > (TILE_H-STREL_SIZE))
      data[threadIdx.x][threadIdx.y + STREL_SIZE-1] = row + STREL_SIZE-1<h ? buffer_in[(row + STREL_SIZE-1)*w + col] : 0;

The 3rd and the 4th part require similar modifications as

 if (threadIdx.x > (TILE_W-STREL_SIZE))
      data[threadIdx.x + STREL_SIZE-1][threadIdx.y] = col + STREL_SIZE-1<w ? buffer_in[row*w+col + STREL_SIZE-1] : 0;

 if ((threadIdx.x > (TILE_W-STREL_SIZE)) && (threadIdx.y > (TILE_H-STREL_SIZE)))
      data[threadIdx.x + STREL_SIZE-1][threadIdx.y + STREL_SIZE-1] = (row + STREL_SIZE-1<h && col + STREL_SIZE-1<w) ? buffer_in[(row + STREL_SIZE-1)*w + col + STREL_SIZE-1] : 0;

Then you should be able to get the correct result image, although there will be 2x2 pixel shift, because you do the convolution on (0...4, 0...4) rather than (-2. .2, -2...2).

For more details, you could read

http://igm.univ-mlv.fr/~biri/Enseignement/MII2/Donnees/convolutionSeparable.pdf

https://www.evl.uic.edu/sjames/cs525/final.html

edited Jul 01 '16 at 13:19

userfi

175
1
1
15

answered Jun 29 '16 at 16:43

kangshiyin

9,681
1
17
29

1

thank's for your answer, now i'm so confused. i'm not understand how you map tile image coordinate with the shared memory. can you explain me it please?. Also for the first part i must change the index of **buffer_in[gLocl]**? then i need change also the block dimension to **20x20**? – userfi Jun 29 '16 at 18:04
@userfi I think my explanation is clear enough. I cannot do it better otherwise it will be too long for an answer. I suggest you read the links I gave you the other day for more details, They are perfect teaching docs. – kangshiyin Jun 29 '16 at 18:21
@Eric also i try to modify my code as you describe but the result is similar – userfi Jun 29 '16 at 18:23
@userfi No, first part is fine. No, you don't need to change block dim. You may want to show a full updated code and the new result. I saw you posted same code in two questions but showing two different results. Which one I should follow? – kangshiyin Jun 29 '16 at 18:38
@userfi you have modifications not from my answer. That's the problem. Maybe you want to focus on the the single image first. – kangshiyin Jun 29 '16 at 19:17
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/116010/discussion-between-userfi-and-eric). – userfi Jun 29 '16 at 19:21
@userfi I found some bug in my code. You could find what is changed in my 8th edit. I've tested it to be correct with a 32x32 input image. However this bug is probably the reason that causes the pattern in your output image. May be you can write a code with `main()` and `#include` to demo how you call your kernel. – kangshiyin Jun 30 '16 at 11:53
@Eric your code work! greate. i've another question relative my problem can i show you it? – userfi Jun 30 '16 at 12:37
@userfi you could just publish on SO. – kangshiyin Jun 30 '16 at 12:45
cause is another version of the algorithm can i append it below the question ? – userfi Jun 30 '16 at 12:49
better not change the question any more. if you want multi-image, just add the term `(plane*h*w)` to each index of `buffer_in[ ... ]` like `buffer_in[ (plane*h*w) + ... ]` – kangshiyin Jun 30 '16 at 12:58
@userfi it is working, but may not be able to reach the peak performance as only part of the threads are working in the loop. Grid size is wrong. – kangshiyin Jun 30 '16 at 17:14
@Eric you're right , i read it in your slide. How i set grido size? Only **grid(512/block.x,512/block.y)** ? – userfi Jun 30 '16 at 17:39
grid(512/16,512/16); – kangshiyin Jun 30 '16 at 17:43
1

@Eric maybe i understand, grido size must be: **grid(512/TILE_H,512/TILE_W)** correct? – userfi Jun 30 '16 at 18:03
@Eric i realized a scheme that show how your part of code load pixel of a tile in shared memory, can you see it and tell me if is correct? i add it in the post. – userfi Jul 01 '16 at 10:07
@userfi it is close. there's no overlapping between 4 parts. – kangshiyin Jul 01 '16 at 10:12
@kangshiyin i add it to explain well the eric method, i adjust my scheme immediately. – userfi Jul 01 '16 at 10:22
1

the black thin line does not exist. the first part is TILE_H x TILE_W – kangshiyin Jul 01 '16 at 10:32
@kangshiyin thus some threads load the same pixel many times? – userfi Jul 01 '16 at 11:30
@kangshiyin no the area is TILE_W + 2*R x TILE_H + 2*R in my figure that is equal to 20 x 20 – userfi Jul 01 '16 at 11:38
ok I was wrong about the area. The height of the 2rd part should be 2*R, as in my answer - from dim 16 to dim 19, which is 4. the same as the width of the 3rd part. – kangshiyin Jul 01 '16 at 11:41
@kangshiyin ok, but i can't understand that part: **row + STREL_SIZE-1 < h ? buffer_in[plane_offset + (row + STREL_SIZE-1)*w + col] : 255;** why i need to load the value from **row + STREL_SIZE-1** to **h** if i have already load them in the first part? – userfi Jul 01 '16 at 11:49
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/116190/discussion-between-kangshiyin-and-userfi). – kangshiyin Jul 01 '16 at 11:52

cuda convolution mapping

1 Answers1

Linked