Matlab taking too much time in computing mex function for large array

Question

I wrote a MATLAB script in which I am passing few scalars and one row vector as input arguments of a mex function and after doing some calculation, it is returning a scalar as output. This process has to be done for all the elements of an array whose size is 1 X 1638400. Below is the corresponding code:

ans=0;
for i=0:1638400-1
    temp = sub_imed(r,i,diff);
    ans  = ans + temp*diff(i+1); 
end

where r,i are scalars, diff is a vector of size 1 X 1638400 and sub_imed is a MEX function which does the below job:

void sub_imed(double r,mwSize base, double* diff, mwSize dim, double* ans)              
{                                                                                           
     mwSize i,k,l,k1,l1;
     double d,g,temp;

     for(i=0; i<dim; i++)
     {   
          k = (base/200) + 1;
          l = (base%200) + 1;
          k1 = (i/200) + 1;
          l1 = (i%200) + 1;

          d = sqrt(pow((k-k1),2) + pow((l-l1),2));

          g=(1/(2*pi*pow(r,2)))*exp(-(pow(d,2))/(2*(pow(r,2))));   

          temp = temp + diff[i]*g;
     }

     *ans  = temp;
}

void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray *prhs[]) 
{
    double *diff;           /* Input data vectors */
    double r;               /* Value of r (input) */
    double* ans;            /* Output ImED distance */
    size_t base,ncols;      /* For storing the size of input vector and base */

    /* Checking for proper number of arguments */
    if(nrhs!=3) 
       mexErrMsgTxt("Error..Three inputs required.");

    if(nlhs!=1) 
       mexErrMsgTxt("Error..Only one output required.");

    /* make sure the first input argument(value of r) is scalar */
    if( !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxGetNumberOfElements(prhs[0])!=1) 
       mexErrMsgTxt("Error..Value of r must be a scalar."); 

    /* make sure that the input value of base is a scalar */
    if( !mxIsDouble(prhs[1]) || mxIsComplex(prhs[1]) || mxGetNumberOfElements(prhs[1])!=1) 
       mexErrMsgTxt("Error..Value of base must be a scalar."); 

    /* make sure that the input vector diff is of type double */
    if(!mxIsDouble(prhs[2]) || mxIsComplex(prhs[2]))    
       mexErrMsgTxt("Error..Input vector must be of type double.");

    /* check that number of rows in input arguments is 1 */
    if(mxGetM(prhs[2])!=1) 
       mexErrMsgTxt("Error..Inputs must be row vectors."); 

    /* Get the value of r */
    r = mxGetScalar(prhs[0]);
    base = mxGetScalar(prhs[1]);

    /* Getting the input vectors */
    diff = mxGetPr(prhs[2]);
    ncols = mxGetN(prhs[2]);

    /* Creating link for the scalar output */
    plhs[0] = mxCreateDoubleMatrix(1,1,mxREAL);
    ans = mxGetPr(plhs[0]); 

    sub_imed(r,base,diff,(mwSize)ncols,ans);
}

For more details about the problem and the underlining algorithm please follow the thread Euclidean distance between images.

I did a profiling of my MATLAB script and got to know that it is taking 63 sec. just for 387 calls to sub_imed() mex function. So for 1638400 calls to sub_imed, ideally it will take around 74 hours which is just too long.

Can someone please help me to optimize the code by suggesting some alternative ways to reduce the time taken for computation.

Thanks in advance.

Did you write this mex function? Why do you use mex and not MATLAB for this? — hbaderts, May 02 '16 at 07:43
Yeah..Actually without mex it was taking even longer..so I thought it will be a good idea to write a mex function for inner loop and it may reduce my running cost. — nagarwal, May 02 '16 at 08:12
Several of the computations: eg. `l = (base%200) + 1;` or `(1/(2*pi*pow(r,2)))/(2*pow(r,2))` can be moved outside the for loop... — Matthew Gunn, May 02 '16 at 08:48
Use `mxCreateDoubleScalar` for the answer, use `const double *` for `diff`, you should initialise `temp` to `0.0` before you start your for loop, and you should replace all calls to `pow(x,2)` by `x*x`. Then, turn on optimisation flags (`-O3` if possible), and let us know about the new timing. — Jonathan H, May 02 '16 at 10:17
@Sh3lijohn I don't think `mxCreateDoubleScalar` will work in my case because I need to pass a pointer to the ans in sub_imed function call. Only then it will be able to reflect the changes made into it by sub_imed function. I checked it with the changes: `double ans;` plhs[0] = mxCreateDoubleScalar(ans)` and `sub_imed(r,base,diff,(mwSize)ncols,&ans); ` but it is producing some garbage values. — nagarwal, May 02 '16 at 11:44
using mex rather than matlab is unlikely to ever help by more than a factor of about 10 (I'm guessing, but it's probably not far off the mark, unless you started with particularly horrendous code in Matlab). So if you know in advance that you need far more than 10x speedup, then you need to try something more interesting than a direct translation to mex. — dan-man, May 02 '16 at 13:49
@nagarwal And the rest? `mxCreateDoubleScalar` is what you should use given what you're doing; of course you can also do it with a 1x1 array, but you shouldn't blame the language for a lack of results when it's your ignorance of that same language which is to blame. Learn about pointers and references if you need. The "garbage values" you mention should be what you should get, given that you don't initialise `temp` correctly. — Jonathan H, May 02 '16 at 16:15

score 3 · Accepted Answer · answered May 02 '16 at 09:18

3

I ported your code back to MATLAB and made some small adjustements, while the results should stay the same. I introduced the following constants:

N = 8192;
step = 0.005;

Note that N / step = 1638400. With that, you can rewrite your variable k (and rename it to baseDiv):

baseDiv = 1 + (0 : step : (N-step)).';

i.e. it is 1:8193 in steps of 0.005. Similarly, l is 1:200 (=1:(1/0.005)), repeated 8192 times in a row, which is (now called baseMod):

baseMod = (repmat(1:1:(1/step), 1, N)).';

Your variables k1 and l1 are simply the ith element of k and l, i.e. baseDiv(i) and baseMod(i).

With the vectors baseDiv, and baseMod, one can calculate d, g and your temporary variable tmp with

d = sqrt((baseDiv(k)-baseDiv).^2 + (baseMod(k)-baseMod).^2);
g = 1/(2*pi*r^2) * exp(-(d.^2) / (2*r^2));
tmp = sum(diffVec .* g);

We can put this into your MATLAB for loop, so the whole program becomes

% Constants
N = 8192;
step = 0.005;

% Some example data
r = 2;
diffVec = rand(N/step,1);

base = (0:(numel(diffVec)-1)).';    
baseDiv = (1:step:1+N-step).';
baseMod = (repmat(1:1:(1/step), 1, N)).';

res = 0;
for k=1:(N/step)
    d = sqrt((baseDiv(k)-baseDiv).^2 + (baseMod(k)-baseMod).^2);
    g = 1/(2*pi*r^2) * exp(-(d.^2) / (2*r^2));
    tmp = sum(diffVec .* g);
    res = res + tmp * diffVec(k);
end

By eliminating the inner for loop and calculating it in a vectorized fashion, you still need 11 sec for 1000 iterations, resulting in a total runtime of 5 hours. Still - a speed-up of more than 10x. To get an even higher speed-up, you have two possibilities:

1) Complete vectorization: You can easily vectorize the remaining for-loop by using bsxfun(@minus, baseDiv, baseDiv.') and suming over the columns to calculate all values at the same time. Unfortunately we have a small problem: a 1638400-by-1638400 double matrix would take up 20TB of RAM, which - I assume - you don't have in your Laptop ;-)

2) Less samples: You are doing some mathematical transform with a resolution of step=0.005. Check if you really, really need this precision! If you take 1/10 of the precision: step=0.05, you are 100 times faster, and are finished within 3 minutes!

answered May 02 '16 at 09:18

hbaderts

14,136
4
41
48

`sum(a.*b)` is usually expressible as a dot-product, i.e. `a*b.'`, this should be faster to compute due to one pass in memory and fused-multiply-add. (actually in Matlab doing a transpose, even of a 1d array, may require doing a full pass in memory, I'm not sure). – dan-man May 02 '16 at 11:01
Sir, thank you very much for an elaborate and substantial reply. It really helped me a lot but I have few things to say about your approach. Actually, I have two images of size 8192 x 200 and I want to calculate a matrix (1638400 x 1638400) containing Euclidean distance between between each and every pixel. Thus, in the variables k & L, I was calculating row & column of 1st image and in k1 & l1, row and column of 2nd image but here baseDiv will contain some floating point data. Anyways I can correct that for me. – nagarwal May 02 '16 at 11:03
@nagarwal - I suspect there is a better way of solving your actual problem, you might want to ask it as a new question, referencing this one. – dan-man May 02 '16 at 11:07
For your last two points, yeah..creating a matrix of that bigger size is not possible in my pc and in fact this was the reason that made me to so all those computations in loop. Earlier I was also trying to do vectorize everything so that it could be just multiplication of some vectors and matrices but I didn't think about partial vectorization. Secondly, since it is a kind of calculating Euclidean distance between images therefore it is necessary for me to consider 200 columns and 8192 rows, not less than that. – nagarwal May 02 '16 at 11:10
@dan-man Yeah..I think dot product will be better to replace with. – nagarwal May 02 '16 at 11:14
Okay..Let's see..I will post that as a new one. – nagarwal May 02 '16 at 11:19
@dan-man: I think you know what I want to do as I have cleared it in the first comment. First of all, it is calculating dist between any 2 pixels in d and then just some aftermaths over it and storing it in g. Finally my Euclidean Distance between the two images will be `diff * G * diff ' ` where `size(diff) = 1x1638400` `size(G) = 1638400 x 1638400` `size(diff ') = 1638400 x 1`. `diff` vector is actually the difference between two images by reshaping them as row vector by the formula `Ip=reshape(permute(gray1,[2,1]),1,1638400);` please post it as an answer if you find any better solution. – nagarwal May 02 '16 at 12:51
I think you could do a better job of explaining this, and it definitely deserves being a separate question. Include a decent diagram, like [this](http://imgur.com/ptPBqKO), which is something I just drew. – dan-man May 02 '16 at 13:22
hmm, it seems like you are just doing a convolution, so it could be made really fast..hence the need to describe your problem really clearly – dan-man May 02 '16 at 13:40
I might have a solution that takes less than 25s for your full sized data, and consists of just 4 lines, but I will only post it if you ask a new question! – dan-man May 02 '16 at 14:04
good! using fft-based convolution my version should now take less than 2s. – dan-man May 02 '16 at 14:32
1

http://stackoverflow.com/questions/36985718/euclidean-distance-between-images Please follow the new question – nagarwal May 02 '16 at 15:09

Jonathan H · Answer 2 · 2016-05-03T20:02:47.523

Did you compile with the optimisation flag? (See flag -O)
Instead of using pow(x,2) in C/C++ you should write x*x
I am almost certain that the reason why it is slow is not because of the code you've shown, but because of what you do in the mexFunction. If I had to guess I'd say you're pointlessly copying the memory in diff, but we need to see the entire Mex function to be sure.

Try the following C++ code using mex -O myfile.cpp:

void sub_imed( double r, size_t base, const double *diff, size_t dim, double& ans)
{
    double d, g;

    // these need to be double to avoid underflow
    double k = base / 200;
    double l = base % 200;

    r = 2*r*r;
    for(; dim; --dim, ++diff )
    {
        d = k - i/200;
        g = l - i%200;

        ans += (*diff) * exp( - (d*d + g*g)/r ) / (pi*r);
    }
}

void mexFunction(int nlhs,mxArray *plhs[],int nrhs,const mxArray *prhs[])
{
    /* Checking for proper number of arguments */
    if(nrhs!=3)
        mexErrMsgTxt("Error..Three inputs required.");

    if(nlhs!=1)
        mexErrMsgTxt("Error..Only one output required.");

    /* make sure the first input argument(value of r) is scalar */
    if( !mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxGetNumberOfElements(prhs[0])!=1 )
        mexErrMsgTxt("Error..Value of r must be a scalar.");

    /* make sure that the input value of base is a scalar */
    if( !mxIsDouble(prhs[1]) || mxIsComplex(prhs[1]) || mxGetNumberOfElements(prhs[1])!=1 )
        mexErrMsgTxt("Error..Value of base must be a scalar.");

    /* make sure that the input vector diff is of type double */
    if( !mxIsDouble(prhs[2]) || mxIsComplex(prhs[2]) )
        mexErrMsgTxt("Error..Input vector must be of type double.");

    /* check that number of rows in input arguments is 1 */
    if( mxGetM(prhs[2])!=1 )
        mexErrMsgTxt("Error..Inputs must be row vectors.");

    /* Get the value of r */
    double r    = mxGetScalar(prhs[0]);
    size_t base = static_cast<size_t>(mxGetScalar(prhs[1]);

    /* Getting the input vectors */
    const double *diff = mxGetPr(prhs[2]);
    size_t nrows = static_cast<size_t>(mxGetN(prhs[2]));

    /* Creating link for the scalar output */
    plhs[0] = mxCreateDoubleScalar(0.0);
    sub_imed( r, base, diff, nrows, *mxGetPr(plhs[0]) );
}

I don't think copying from plhs to diff is pointless. Anyways I have edited my question and now you can see the complete mex function. — nagarwal, May 02 '16 at 09:22
I don't understand your comment, you're not copying it, and certainly not from `plhs` anyway. Use `mxCreateDoubleScalar` and you should use `const double*`, not `double*` for `diff`. — Jonathan H, May 02 '16 at 10:10
What's the updated timing with optimisations flag and without calls to `pow`? — Jonathan H, May 02 '16 at 10:14
@Sh3lijohn: Thanks for the code. I will implement it and will keep you posted regarding its performance. — nagarwal, May 03 '16 at 07:36
@Sheljohn: Sorry, I was a bit busy so couldn't follow up with you guys..Anyways in the function sub_imed, you calculated `d = k - i/200;` & `g = l - i%200;` but `i` is not declared anywhere. So, I think you mean it to be `d = k - dim/200;` & `g = l - dim%200;`..Right !! — nagarwal, May 04 '16 at 06:23
@Sheljohn: Unfortunately, there is no improvement using this approach. It is taking around 0.11 second for one iteration. Therefore, for 1638400 iterations it will take around 50 hours..!! — nagarwal, May 04 '16 at 06:45
I wouldn't call 24 hours "no improvement", but it is still impractical. :) Hope you get better luck with the Matlab version then! — Jonathan H, May 04 '16 at 14:20
Yeah..surely it was a good improvement. sorry for my words..Thank you very much for your code.. — nagarwal, May 04 '16 at 14:45

Matlab taking too much time in computing mex function for large array

2 Answers2

Linked