1

I am reprogramming a piece of MATLAB code in mex (using C). So far my C version of the MATLAB code is about as double as fast as the MATLAB code. Now I have three questions, all related to the code below:

  1. How can I speed up this code more?
  2. Do you see any problems with this code? I ask this because I don't know mex very well and I am also not a C guru ;-) ... I am aware that there should be some checks in the code (for example if there is still heap space while using realloc, but I left this away for the sake of simplicity for the moment)
  3. Is it possible, that MATLAB is optimizing so well, that I really can't get much more than twice as fast code in C...?

The code should be more or less platform independent (Win, Linux, Unix, Mac, different Hardware), so I don't want to use assembler or specific linear Algebra Libraries. So that's why I programmed the staff by myself...

#include <mex.h>
#include <math.h>
#include <matrix.h>

void mexFunction(
    int nlhs, mxArray *plhs[],
    int nrhs, const mxArray *prhs[])
{
    double epsilon = ((double)(mxGetScalar(prhs[0])));
    int strengthDim = ((int)(mxGetScalar(prhs[1])));
    int lenPartMat = ((int)(mxGetScalar(prhs[2])));
    int numParts = ((int)(mxGetScalar(prhs[3])));
    double *partMat = mxGetPr(prhs[4]);
    const mxArray* verletListCells = prhs[5];
    mxArray *verletList;

    double *pseSum = (double *) malloc(numParts * sizeof(double));
    for(int i = 0; i < numParts; i++) pseSum[i] = 0.0;

    float *tempVar = NULL;

    for(int i = 0; i < numParts; i++)
    {
        verletList = mxGetCell(verletListCells,i);
        int numberVerlet = mxGetM(verletList);

        tempVar = (float *) realloc(tempVar, numberVerlet * sizeof(float) * 2);


        for(int a = 0; a < numberVerlet; a++)
        {
            tempVar[a*2] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1] - partMat[i];
            tempVar[a*2 + 1] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + lenPartMat] - partMat[i + lenPartMat];

            tempVar[a*2] = pow(tempVar[a*2],2);
            tempVar[a*2 + 1] = pow(tempVar[a*2 + 1],2);

            tempVar[a*2] = tempVar[a*2] + tempVar[a*2 + 1];
            tempVar[a*2] = sqrt(tempVar[a*2]);

            tempVar[a*2] = 4.0/(pow(epsilon,2) * M_PI) * exp(-(pow((tempVar[a*2]/epsilon),2)));
            pseSum[i] = pseSum[i] + ((partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + 2*lenPartMat] - partMat[i + (2 * lenPartMat)]) * tempVar[a*2]);
        }

    }

    plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
    for(int a = 0; a < numParts; a++)
    {
        *(mxGetPr(plhs[0]) + a) = pseSum[a];
    }

    free(tempVar);
    free(pseSum);
}

So this is the improved version, which is about 12 times faster than MATLAB version. The conversion thing is still eating up much time, but I let this away for now, becaues I have to change something in MATLAB for this. So first focus on the remaining C code. Do you see any more potential in the following code?

#include <mex.h>
#include <math.h>
#include <matrix.h>

void mexFunction(
    int nlhs, mxArray *plhs[],
    int nrhs, const mxArray *prhs[])
{
    double epsilon = ((double)(mxGetScalar(prhs[0])));
    int strengthDim = ((int)(mxGetScalar(prhs[1])));
    int lenPartMat = ((int)(mxGetScalar(prhs[2])));
    double *partMat = mxGetPr(prhs[3]);
    const mxArray* verletListCells = prhs[4];
    int numParts = mxGetM(verletListCells);
    mxArray *verletList;

    plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
    double *pseSum = mxGetPr(plhs[0]);

    double epsilonSquared = epsilon*epsilon;

    double preConst = 4.0/((epsilonSquared) * M_PI);

    int numberVerlet = 0;

    double tempVar[2];

    for(int i = 0; i < numParts; i++)
    {
        verletList = mxGetCell(verletListCells,i);
        double *verletListPtr = mxGetPr(verletList);
        numberVerlet = mxGetM(verletList);

        for(int a = 0; a < numberVerlet; a++)
        {
            int adress = ((int) (*(verletListPtr + a))) - 1;

            tempVar[0] = partMat[adress] - partMat[i];
            tempVar[1] = partMat[adress + lenPartMat] - partMat[i + lenPartMat];

            tempVar[0] = tempVar[0]*tempVar[0] + tempVar[1]*tempVar[1];

            tempVar[0] = preConst * exp(-(tempVar[0]/epsilonSquared));
            pseSum[i] += ((partMat[adress + 2*lenPartMat] - partMat[i + (2*lenPartMat)]* tempVar[0]);
        }

    }

}
Toribio
  • 3,963
  • 3
  • 34
  • 48
Michael
  • 706
  • 9
  • 29
  • Can you post the original Matlab code also. Often, the best optimisations for speed are performed at the algorithm design level. – learnvst Sep 05 '12 at 01:38

2 Answers2

2

Can you estimate ahead of time what will be the maximum size of tempVar and allocate memory for it before the loop instead of using realloc? Reallocating memory is a time consuming operation and if your numParts is large, this could have a huge impact. Take a look at this question.

Community
  • 1
  • 1
Malife
  • 303
  • 2
  • 6
2
  • You do not need to allocate the pseSum for local use and then later copy the data to the output. You can simply allocate a MATLAB object and get the pointer to the memory :

    plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
    pseSum  = mxGetPr(plhs[0]);
    

Thus you will not have to initialize pseSum to 0, because MATLAB already does it in mxCreateDoubleMatrix.

  • Remove all the mxGetPr from the inner loop and assign them to variables before.

  • Instead of casting doubles to ints consider using int32 or uint32 arrays in MATLAB. Casting double to int is expensive. The internal loop computations would look like

    tempVar[a*2] = partMat[somevar[a] - 1] - partMat[i];
    

    You use such constructs in your code

    ((int) (*(mxGetPr(verletList) + a)))
    

    You do it because the varletList is a 'double' array (that is the case by default in MATLAB), which holds integer values. Instead, you should use integer array. Before you call your mex file type in MATLAB:

    varletList = int32(varletList);
    

    Then you will not need the type cast to int above. You will simply write

    ((int*)mxGetData(verletList))[a]
    

    or better yet, assign earlier

    somevar = (int*)mxGetData(verletList);
    

    and later write

    somevar[a]
    
  • precompute 4.0/(pow(epsilon,2) * M_PI) before all loops! That is one expensive constant.

  • pow((tempVar[a*2]/epsilon),2)) is simply tempVar[a*2]^2/epsilon^2. You calculate sqrt(tempVar[a*2]) just before. Why do you square it now?

  • Generally do not use pow(x, 2). Just write x*x

  • I would add some sanity checks on the parameters, especially if you demand integers. Either use MATLABs int32/uint32 type, or check that what you get actually is an integer.

Edit in the new code

  • compute -1/epsilonSquared before the loops and compute exp(minvepssq*tempVar[0]).note that the result might differ slightly. Depends what you need, but if you don't care about exact order of operations, do it.

  • define a register variable preSum_r and use it to sum the results in the inner loop. After the loop assign it to preSum[i]. If you want more fun, you can write the result to the memory using SSE streaming store (_mm_stream_pd compiler intrinsic).

  • do remove double to int cast

  • most likely irrelevant, but try to change tempVar[0/1] to normal variables. Irrelevant, because the compiler should do that for you. But again, an array is not needed here.

  • parallelise the external loop with OpenMP. Trivial (at least the simplest version without thinking about data layout for NUMA architectures) since there is no dependence between the iterations.

angainor
  • 11,760
  • 2
  • 36
  • 56
  • 1
    Thanks guys for your comments!!! Some of them are really easy or just basic math. Should have seen that by myself :-( Now I got my code about 10 times faster than the Matlab code, which gets quite satisfying. I implemented all your hints, except the one from angainor about the casting of doubles. Can you tell me a bit more about that, don't really understand that... (Especially how to do). And why is x*x better than pow(x,2)? – Michael Sep 05 '12 at 12:55
  • You are right in that it might not matter - if the compiler is reasonably good it will find this optimization and instead of calling the pow function, which in the general case is more expensive, will simply execute X*x. But why risk it? The same with the expensive constant calculated inside the loop. The compiler might find it, but why to put it there in the first place? In the end you should just benchmark and check.. – angainor Sep 05 '12 at 14:15
  • Another question. Did some googling on that, but I am still not sure. Is there any faster exp() version in C??? Sure, this one is otimized to the max, but maybe there are some with a bit loss of precission but still good enough for my purpose... – Michael Sep 06 '12 at 17:14
  • I suggest posting a new question about the exp. Link this question there to show what type of code you have. – angainor Sep 06 '12 at 18:24
  • Hey, thanks for your suggestions. Using now openMP and this speeds the whole thing up!!! So this is now really fast enough! – Michael Sep 06 '12 at 20:24