I am reprogramming a piece of MATLAB code in mex (using C). So far my C version of the MATLAB code is about as double as fast as the MATLAB code. Now I have three questions, all related to the code below:
- How can I speed up this code more?
- Do you see any problems with this code? I ask this because I don't know mex very well and I am also not a C guru ;-) ... I am aware that there should be some checks in the code (for example if there is still heap space while using
realloc
, but I left this away for the sake of simplicity for the moment) - Is it possible, that MATLAB is optimizing so well, that I really can't get much more than twice as fast code in C...?
The code should be more or less platform independent (Win, Linux, Unix, Mac, different Hardware), so I don't want to use assembler or specific linear Algebra Libraries. So that's why I programmed the staff by myself...
#include <mex.h>
#include <math.h>
#include <matrix.h>
void mexFunction(
int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double epsilon = ((double)(mxGetScalar(prhs[0])));
int strengthDim = ((int)(mxGetScalar(prhs[1])));
int lenPartMat = ((int)(mxGetScalar(prhs[2])));
int numParts = ((int)(mxGetScalar(prhs[3])));
double *partMat = mxGetPr(prhs[4]);
const mxArray* verletListCells = prhs[5];
mxArray *verletList;
double *pseSum = (double *) malloc(numParts * sizeof(double));
for(int i = 0; i < numParts; i++) pseSum[i] = 0.0;
float *tempVar = NULL;
for(int i = 0; i < numParts; i++)
{
verletList = mxGetCell(verletListCells,i);
int numberVerlet = mxGetM(verletList);
tempVar = (float *) realloc(tempVar, numberVerlet * sizeof(float) * 2);
for(int a = 0; a < numberVerlet; a++)
{
tempVar[a*2] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1] - partMat[i];
tempVar[a*2 + 1] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + lenPartMat] - partMat[i + lenPartMat];
tempVar[a*2] = pow(tempVar[a*2],2);
tempVar[a*2 + 1] = pow(tempVar[a*2 + 1],2);
tempVar[a*2] = tempVar[a*2] + tempVar[a*2 + 1];
tempVar[a*2] = sqrt(tempVar[a*2]);
tempVar[a*2] = 4.0/(pow(epsilon,2) * M_PI) * exp(-(pow((tempVar[a*2]/epsilon),2)));
pseSum[i] = pseSum[i] + ((partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + 2*lenPartMat] - partMat[i + (2 * lenPartMat)]) * tempVar[a*2]);
}
}
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
for(int a = 0; a < numParts; a++)
{
*(mxGetPr(plhs[0]) + a) = pseSum[a];
}
free(tempVar);
free(pseSum);
}
So this is the improved version, which is about 12 times faster than MATLAB version. The conversion thing is still eating up much time, but I let this away for now, becaues I have to change something in MATLAB for this. So first focus on the remaining C code. Do you see any more potential in the following code?
#include <mex.h>
#include <math.h>
#include <matrix.h>
void mexFunction(
int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double epsilon = ((double)(mxGetScalar(prhs[0])));
int strengthDim = ((int)(mxGetScalar(prhs[1])));
int lenPartMat = ((int)(mxGetScalar(prhs[2])));
double *partMat = mxGetPr(prhs[3]);
const mxArray* verletListCells = prhs[4];
int numParts = mxGetM(verletListCells);
mxArray *verletList;
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
double *pseSum = mxGetPr(plhs[0]);
double epsilonSquared = epsilon*epsilon;
double preConst = 4.0/((epsilonSquared) * M_PI);
int numberVerlet = 0;
double tempVar[2];
for(int i = 0; i < numParts; i++)
{
verletList = mxGetCell(verletListCells,i);
double *verletListPtr = mxGetPr(verletList);
numberVerlet = mxGetM(verletList);
for(int a = 0; a < numberVerlet; a++)
{
int adress = ((int) (*(verletListPtr + a))) - 1;
tempVar[0] = partMat[adress] - partMat[i];
tempVar[1] = partMat[adress + lenPartMat] - partMat[i + lenPartMat];
tempVar[0] = tempVar[0]*tempVar[0] + tempVar[1]*tempVar[1];
tempVar[0] = preConst * exp(-(tempVar[0]/epsilonSquared));
pseSum[i] += ((partMat[adress + 2*lenPartMat] - partMat[i + (2*lenPartMat)]* tempVar[0]);
}
}
}