finding local mean in an image using mex-cuda

Question

I have an image named HSIImage, of size is 565x585, in which I have find the local mean and standard deviation at every pixel. For this I am using a window W of size 9x9, if we a re finding the mean of x(i,j) we need values in the W where x(i,j) is at its center.

For working on the corner and edge pixels, I am padding the HSIImage and naming it as HSIImage2.

MATLAB code

[m,n,~]  =  size(HSIImage);
HSIImage2=padarray(HSIImage,[4,4],'symmetric');

mean1    = zeros(m,n);
sd       = zeros(m,n);
phi_x=zeros(m,n);

for i=5:m+4
    for j=5:n+4
        mean1(i-4,j-4) = mean( mean(HSIImage2(i-4:i+4, j-4:j+4, 3) )); %sum / (4*4);
        sd(i-4,j-4) = std( std(HSIImage2(i-4:i+4, j-4:j+4, 3), 1)); 
    end
end
[phi_x2,mean2,sd2] = getPhi(HSIImage(:,:,3)',HSIImage2(:,:,3)',m,n);

Serial mean displayed as image. My cuda code for finding mean and sd is

__global__ void phi(double *d_HSIImage,double *d_HSIImage2, int row, int col, double *d_phi_x, double *d_mean, double *d_std)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
int i,j;
double sum = 0;

if(Y>3 && X>3 && Y<row+4 && X<col+4)
{
    for(i=Y-4;i<=Y+4;i++){
        for(j=X-4;j<=X+4;j++){
            sum= sum + d_HSIImage2[i*col+j];
        }
    }

    d_mean[(Y-4)*col+X-4] = sum/81;
    double mean = sum/81;
    sum = 0;

    for(i=Y-4;i<=Y+4;i++){
        for(j=X-4;j<=X+4;j++){
            int index = i*col+j;
            sum= sum + (d_HSIImage2[index]-mean) * (d_HSIImage2[index]-mean);
        }
    }
    d_std[(Y-4)*col+X-4] = sqrt(sum/81);
}
void mexFunction( int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[])
{
double*     HSIImage;
double*     d_HSIImage;
double*     HSIImage2;
double*     d_HSIImage2;
double      row;
double      col;
double*     phi_x;
double*     d_phi_x;
double*     mean2;
double*     d_mean;
double*     d_std;
double*     sd2;

HSIImage    = (double*)mxGetPr(prhs[0]);
HSIImage2   = (double*)mxGetPr(prhs[1]);
row         = mxGetScalar(prhs[2]);
col         = mxGetScalar(prhs[3]);

plhs[0]     = mxCreateDoubleMatrix(row,col,mxREAL);
phi_x       = mxGetPr(plhs[0]);
plhs[1]     = mxCreateDoubleMatrix(row,col,mxREAL);
mean2       = mxGetPr(plhs[1]);
plhs[2]     = mxCreateDoubleMatrix(row,col,mxREAL);
sd2         = mxGetPr(plhs[2]);

dim3 grid(((col+8)/TILE_WIDTH)+1,((row+8)/TILE_WIDTH)+1,1);
dim3 block(TILE_WIDTH,TILE_WIDTH,1);

if ( cudaMalloc(&d_HSIImage,sizeof(double)*row*col)!= cudaSuccess )  
    mexErrMsgTxt("Memory allocating failure on the GPU");
if ( cudaMalloc(&d_HSIImage2,sizeof(double)*(row+8)*(col+8))!= cudaSuccess )  
    mexErrMsgTxt("Memory allocating failure on the GPU");
if ( cudaMalloc(&d_phi_x,sizeof(double)*row*col)!= cudaSuccess )  
    mexErrMsgTxt("Memory allocating failure on the GPU");
if ( cudaMalloc(&d_mean,sizeof(double)*row*col)!= cudaSuccess )  
    mexErrMsgTxt("Memory allocating failure on the GPU");
if ( cudaMalloc(&d_std,sizeof(double)*row*col)!= cudaSuccess )  
    mexErrMsgTxt("Memory allocating failure on the GPU");

cudaMemcpy(d_HSIImage,HSIImage,sizeof(double)*row*col,cudaMemcpyHostToDevice);
cudaMemcpy(d_HSIImage2,HSIImage2,sizeof(double)*(row+8)*(col+8),cudaMemcpyHostToDevice);

phi <<< grid,block >>> (d_HSIImage,d_HSIImage2,row,col,d_phi_x,d_mean,d_std);

cudaMemcpy(phi_x,d_phi_x,sizeof(double)*row*col,cudaMemcpyDeviceToHost);
cudaMemcpy(mean2,d_mean,sizeof(double)*row*col,cudaMemcpyDeviceToHost);
cudaMemcpy(sd2,d_std,sizeof(double)*row*col,cudaMemcpyDeviceToHost);

cudaFree(d_HSIImage);
cudaFree(d_HSIImage2);
cudaFree(d_phi_x);

}

its working fine when image is full of ones. but when I give regular image, there is lot of difference in serial(MATLAB) and parallel(CUDA) outputs(When mean1 and mean2 are compared). Please tell me the error.

I am launching with

dim3 grid(((col+8)/TILE_WIDTH)+1,((row+8)/TILE_WIDTH)+1,1);
dim3 block(TILE_WIDTH,TILE_WIDTH,1);

TILEWIDTH is 32. row=565, col=584. Parallel mean displayed as image enter image description here

No, I mean what `is` the magnitude of the differences you are seeing. — chappjc, Feb 06 '15 at 08:44
What do you mean with serial and parallel outputs? How do you view d_mean and d_std? In what float ranges are the incoming values? I'm not sure about HSI, is that float, and all values in [0,1]? Are you passing in the 3 channels H, S and I? Does that make sense? You could try do debug your code by passing in an greyscaled image, i.e. just one channel, maybe as a float in [0,1]. — Albert, Feb 06 '15 at 08:46
My image is of size 565x584. Many values which are in range of 0.0982-0.0911(serial) are being calculated in the range of 0.3259-0.0.2014(parallel). the ranges are approximate. — Pawan, Feb 06 '15 at 08:52
HSI is the name of the image. I am using only intentsities in MATLAB and sending only intensities to mex. so array is 2d . — Pawan, Feb 06 '15 at 08:54
you can calc the local mean faster with `conv2(I,ones(9)./9^2,'same')` — bla, Feb 08 '15 at 20:20
what do you mean by "no upto the mark"?. using `conv2` with a normalized `ones` filter is exactly taking the local mean... — bla, Feb 09 '15 at 21:45
Sorry, in work pressure, I missed out the 'same' inside conv2. can you tell me something for finding standard deviation? If you see my CUDA code I am finding `std` too. If some CUDA answer helped me about indexing that would have helped me in finding std too. As you solved my prblm with MATLAB itself, I am asking for std....Thankyou for the answer. — Pawan, Feb 10 '15 at 04:25
I believe I've fixed your indexing question, please look at the answer below. If you are still having difficulty with the standard deviation let me know. Otherwise please let me know what has not been answered. — Christian Sarofeen, Feb 12 '15 at 00:07

Christian Sarofeen · Answer 1 · 2015-02-10T21:22:32.793

It is important to note Matlab's c api is column-major ordered, however as mentioned in the comments OP has made sure of the consistency. The problem is that the stride used to access the data did not include the pads of the image. Going from one row to another requires a stride of col+8 (8 being padding of 4 on each side.

changing

__global__ void phi(double *d_HSIImage,double *d_HSIImage2, int row, int col, double *d_phi_x, double *d_mean, double *d_std)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
int i,j;
double sum = 0;

if(Y>3 && X>3 && Y<row+4 && X<col+4)
{
    for(i=Y-4;i<=Y+4;i++){
        for(j=X-4;j<=X+4;j++){
            sum= sum + d_HSIImage2[i*col+j];
        }
    }

    d_mean[(Y-4)*col+X-4] = sum/81;
    double mean = sum/81;
    sum = 0;

    for(i=Y-4;i<=Y+4;i++){
        for(j=X-4;j<=X+4;j++){
            int index = i*col+j;
            sum= sum + (d_HSIImage2[index]-mean) * (d_HSIImage2[index]-mean);
        }
    }
    d_std[(Y-4)*col+X-4] = sqrt(sum/81);
}

to

__global__ void phi(double *d_HSIImage,double *d_HSIImage2, int row, int col, double *d_phi_x, double *d_mean, double *d_std)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
int i,j;
double sum = 0;

if(Y>3 && X>3 && Y<row+4 && X<col+4)
{
    for(i=Y-4;i<=Y+4;i++){
        for(j=X-4;j<=X+4;j++){
            sum= sum + d_HSIImage2[i*(col+8)+j];
        }
    }

    d_mean[(Y-4)*col+X-4] = sum/81;
    double mean = sum/81;
    sum = 0;

    for(i=Y-4;i<=Y+4;i++){
        for(j=X-4;j<=X+4;j++){
            int index = i*(col+8)+j;
            sum= sum + (d_HSIImage2[index]-mean) * (d_HSIImage2[index]-mean);
        }
    }
    d_std[(Y-4)*col+X-4] = sqrt(sum/81);
}

Should work, however, I have included a compilable example that I validated on a small sample, that should be easy to expand.

It is not optimized, but that wasn't part of your question. Optimization using shared memory would give a large performance boost.

#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#include <cuda.h>

using namespace std;

__global__ void phi(double *img, int row, int col, double *d_mean){

  int X=blockDim.x*blockIdx.x+threadIdx.x+4;
  int Y=blockDim.y*blockIdx.y+threadIdx.y+4;
  double sum = 0;

  if(Y<row+4 && X<col+4){


    for(int i=-4; i<=4; ++i){
      for(int j=-4; j<=4; ++j){
        sum+=img[ (Y+j)*(col+8)+X+i];
      }
    }

    sum/=81;

    d_mean[(Y-4)*col+X-4]=sum;

  }

}

int main(int argc, char * argv[]) {

  int width=10, height=10;
  double *h_img=new double[(width+8)*(height+8)];

  for(int i=0; i<height+8; i++){
    for(int j=0; j<width+8; j++){
      h_img[i*(width+8)+j]=0.0;
    }
  }

  for(int i=0; i<height; i++){
    for(int j=0; j<width; j++){
      int index = (i+4)*(width+8)+j+4;
      h_img[index]=i*width+j;
    }
  }

  for(int i=0; i<height+8; i++){
    for(int j=0; j<width+8; j++){
      cout<<h_img[i*(width+8)+j]<<" ";
    }cout<<endl;
  }

  double *d_img;
  size_t size=sizeof(double)*(height+8)*(width*8);
  cudaMalloc(&d_img, size);
  cudaMemcpy(d_img, h_img, size, cudaMemcpyHostToDevice);

  size = sizeof(double)*height*width;
  double *d_avg;
  cudaMalloc(&d_avg, size);

  dim3 block(32, 32, 1);
  dim3 grid(width/32+1, height/32+1, 1);
  phi<<<grid, block>>>(d_img, height, width, d_avg);
  cudaDeviceSynchronize();
  double *h_avg=new double[width*height];
  cudaMemcpy(h_avg, d_avg, size, cudaMemcpyDeviceToHost);

  for(int i=0; i<height; i++){
    for(int j=0; j<width; j++){
      cout<<h_avg[i*width+j]<<" ";
    }cout<<endl;
  }

   return 0;
}

Thanks for replying . This is the formula for padding and I have already kept it in the question. `HSIImage2=padarray(HSIImage,[4,4],'symmetric');` If you observe this line,`[phi_x2,mean2,sd2] = getPhi(HSIImage(:,:,3)',HSIImage2(:,:,3)',m,n);` it is clear that I am sending the transpose of the image. — Pawan, Feb 09 '15 at 16:13
sum= sum + d_HSIImage2[i*col+j]; col is not the correct stride, it should be col+2*PAD (col+8) — Christian Sarofeen, Feb 09 '15 at 20:57
See edits in answer for a functioning example on a 10x10 zero-padded with size 4 on each size — Christian Sarofeen, Feb 09 '15 at 21:26
I am updating my new CUDA code which is finding the mean exactly same as MATLAB. But a problem with standard deviation. Please have a look. May be final changes....Thankyou — Pawan, Feb 13 '15 at 07:48

score 0 · Accepted Answer · edited May 23 '17 at 11:48

0

Here's my 2 cents regarding local mean and local std. You should check whether using matlab's optimized built-in functions (conv2 and stdfilt , with their gpu support) gives you better performance than a "simple" mex version. For example, to take the local mean, the fastest will be to use conv2 as follows:

local_mean_image=conv2(image,normalized_window,'same');

where in your case normalized_window=ones(9)./9^2;

For local std use stdfilt :

local_std_image = stdfilt(image, ones(9));

Both options are available for faster GPU performance, I use conv2 with Jacket routinely, and I saw the stdfilt supports gpuarray variables.

edited May 23 '17 at 11:48

Community

1
1

answered Feb 12 '15 at 01:52

bla

25,846
10
70
101

Still problem with std deviation. _[link]https://drive.google.com/file/d/0B_Op_NoIhRnbaXEtMm1BY3AzbWc/view?usp=sharing_. This is the difference I get when I compare `stdfilt(...)` and `std(std(...))`. And sadly it is affecting my final output. – Pawan Feb 12 '15 at 05:17
std(std(... is not what you want though. You are calculating the wrong thing! think that you want to take the std of the 9x9 block around a pixel, this means you take the std of 81 values and put it where the pixel is. However when you take the std of 9 vectors you get 9 values. when you take again the std of these 9 values you get a different value, you erased the statistical sampling ! try in matlab `std(std(a))` vs `std(a(:))` for any matrix `a`. (for example `a=reshape(1:81,9,[])`) – bla Feb 12 '15 at 05:45
I understood the differences. Thanks for that. But `std(std(..))` is giving me best final output I checked it twice before replying. Is there any way to get same output as `std(std(..))` using `stdfilt(...)`? – Pawan Feb 12 '15 at 06:13
no, but you can use `im2col` to perfrom the `std(std(` without for looping. Why are you using std(std( to begin with? please understand that by doing so you are *not* finding the local standard deviation at every pixel. – bla Feb 12 '15 at 06:52
For each pixel x (i,j) in the image, a M x M window W with x as the centre, I want standard deviation of the intensities within the window. – Pawan Feb 12 '15 at 10:39
1

exactly, let me try to explain you again why your are doing it wrong. So lets say W=9x9 window, by doing `std(std(x(i-4:i+4,j-4:j+4))`, you first calculate the `v=[ std(x(:,j-4)) std(x(:,j-3)) ... std(x(:,j+4))]` this leaves you with a vector of 9 values, each has the std of a column of the block. Then you `std(v)` that gives you a single value, but of the output of `v` not the original 81 intensities around x(i,j). Try for example `a=reshape(1:81,9,[])` this is a 9x9 matrix that has intensities from 1 to 81, so for a(5,5) this should emulate the 9x9 window. see how `std(a(:))` is different – bla Feb 12 '15 at 18:18
1

than `std(std(a))`. If you beak it down to parts you first look on the inner parenthesis: `std(a)` this is equivalent to `[std(a(:,1) std(a:,2)... std(a:,9)]`, in all cases the result will be `2.7386` because in each column there numbers range by only 9 (10 to 18, 64 to 72 etc). so another std on that `std([2.7386 2.7386 2.7386... 2.7386]) is zero. However if instead you take all the intensities in `a` by flattening in to a vector `a(:)` then `std(a(:))=23.5266` – bla Feb 12 '15 at 18:24
My last question. Can you show me how to use `im2col` as std(std(..)) – Pawan Feb 13 '15 at 07:32
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/70833/discussion-between-saipavan579-and-bla). – Pawan Feb 13 '15 at 07:45

score 0 · Answer 3 · edited Feb 13 '15 at 10:47

0

By observing the answers of @Christian Sarofeen and of @bla, I made some changes to my code and now I am able to find the mean exactly same as MATLAB. I posting this thinking that some one may use it in future(I am sending the image as is from MATLAB). Still finding standard deviation is little problem.

__global__ void phi(double *d_HSIImage,double *d_HSIImage2, int row, int col, double *d_phi_x, double *d_mean, double *d_std)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
int i,j;
double sum = 0;

if(Y>3 && X>3 && Y<row+4 && X<col+4)
{
    int index = (X-4)*row+Y-4;

    for(i=-4;i<=4;i++){
        for(j=-4;j<=4;j++){
            sum= sum + d_HSIImage2[(X+j)*(row+8)+(Y+i)];
        }
    }

    d_mean[index] = sum/81;
    double mean = 0;
    double temp_std[9] = {0} ;

    for(j=-4;j<=4;j++){
       sum = 0;
       for(i=-4;i<=4;i++){
           sum = sum + d_HSIImage2[(X+j)*(row+8)+(Y+i)];//vector mean
       }
       mean = sum/9;
       sum =0 ;
       for(i=-4;i<=4;i++){
          int index = (X+j)*(row+8)+(Y+i);
          sum= sum + (d_HSIImage2[index]-mean) * (d_HSIImage2[index]-mean);
        }
        temp_std[j+4] = (sqrt(sum/9));//vector std
    }
    sum =0 ;
    for(j=-4;j<=4;j++){
        sum = sum + temp_std[j+4];//mean of vectors
    }

    mean = sum/9;
    sum = 0 ;

    for(j=-4;j<=4;j++){
        sum = sum + (temp_std[j+4]-mean) * (temp_std[j+4]-mean);
    }

    d_std[index] = sqrt(sum);//std of vectors

    d_phi_x[index] = 1.0/(1.0+exp((d_mean[index]-d_HSIImage[index])/d_std[index]));
}
}

edited Feb 13 '15 at 10:47

gboffi

22,939
8
54
85

answered Feb 13 '15 at 07:57

Pawan

423
6
28

The problem is the mean that you're using. You are not finding the true standard deviation, but the standard deviation of the standard deviation of the columns. Therefore you need to calculate the average of the columns, use that for the standard deviation of each column. Then you need the average of the standard deviations, then the standard deviation of the standard deviations. As has been mentioned before, this is not really the standard deviation of the window. To calculate the real std of the window you could have done: submtrx=HSIImage2(i-4:i+4, j-4:j+4, 3); sd(i-4,j-4)=std(submtrx(:)); – Christian Sarofeen Feb 13 '15 at 19:23
I want that only, std(std(...)). That is giving me right answer finally. In the code above I am calculating mean of each column and then std of all columns. Later mean of all stds and std of that means. – Pawan Feb 14 '15 at 03:58
The right answer compared to what? What method are you actually trying to follow? – Christian Sarofeen Feb 14 '15 at 17:23
Right answer in the sense when I use result of std(std(..)) in my project. Final output of my project is depended on that std(std(...)) – Pawan Feb 15 '15 at 04:05
As you solved my problem in CUDA, I am going to give this bounty to you. Even answer of @Bla helped me, but to learn something about MATLAB. – Pawan Feb 15 '15 at 04:09
@Christian Please look at this. [link]http://stackoverflow.com/questions/28522947/del2-of-matlab-in-cuda – Pawan Feb 15 '15 at 04:21

finding local mean in an image using mex-cuda

3 Answers3