I have determined with the "Random-Stop-Method" that the following two lines appear to be very slow:
cv::Mat pixelSubMue = pixel - vecMatMue[kk_real]; // ca. 35.5 %
cv::Mat pixelTemp = pixelSubMue * covInvRef; // ca. 58.1 %
cv::multiply(pixelSubMue, pixelTemp, pixelTemp); // ca. 0 %
cv::Scalar sumScalar = cv::sum(pixelTemp); // ca. 3.2 %
double cost = sumScalar.val[0] * 0.5 + vecLogTerm[kk_real]; // ca. 3.2 %
vecMatMue[kk_real]
is astd::vector<cv::Mat>
<- I know there is a lot of copying involved, but using pointers does not change a lot in performance herepixelSubMue
is acv::Mat(1, 3, CV_64FC1)
vectorcovInvRef
is a reference to acv::Mat(3, 3, CV_64FC1)
matrixvecLogTerm[kk_real]
is astd::vector<double>
The code snippet above is in an inner loop, that is called millions of times.
Question: Is there a way to improve the speed of that operation?
Edit: Thanks for the comments! I have now measured the time within the program and the percentages indicate how much of the time is spent on each line. The measurements were done in Release mode. I have done six measurements, each time the code was executed millions of times.
I should probably also mention, that the std::vector
objects have no effect on the performance, I did just replace them with constant objects.
Edit 2: I have also implemented the algorithm using the C-Api. The relevant lines look like this now:
cvSub(pixel, vecPMatMue[kk], pixelSubMue); // ca. 24.4 %
cvMatMulAdd(pixelSubMue, vecPMatFCovInv[kk], 0, pixelTemp); // ca. 39.0 %
cvMul(pixelSubMue, pixelTemp, pixelSubMue); // ca. 22.0 %
CvScalar sumScalar = cvSum(pixelSubMue); // ca. 14.6 %
cost = sumScalar.val[0] * 0.5 + vecFLogTerm[kk]; // ca. 0.0 %
The C++ implementation needs for the same input data ca. 3100 msec while the C-Implementation needs only ca. 2050 msec (both measurements refer to the total time for executing the snippet millions of times). But I still prefer my C++ implementation, since it is easier to read for me (other "ugly" changes had to be made to make it work with the C-API).
Edit 3: I have rewritten the code without using any function calls for the actual calculations:
capacity_t mue0 = meanRef.at<double>(0, 0);
capacity_t mue1 = meanRef.at<double>(0, 1);
capacity_t mue2 = meanRef.at<double>(0, 2);
capacity_t sigma00 = covInvRef.at<double>(0, 0);
capacity_t sigma01 = covInvRef.at<double>(0, 1);
capacity_t sigma02 = covInvRef.at<double>(0, 2);
capacity_t sigma11 = covInvRef.at<double>(1, 1);
capacity_t sigma12 = covInvRef.at<double>(1, 2);
capacity_t sigma22 = covInvRef.at<double>(2, 2);
mue0 = p0 - mue0; mue1 = p1 - mue1; mue2 = p2 - mue2;
capacity_t pt0 = mue0 * sigma00 + mue1 * sigma01 + mue2 * sigma02;
capacity_t pt1 = mue0 * sigma01 + mue1 * sigma11 + mue2 * sigma12;
capacity_t pt2 = mue0 * sigma02 + mue1 * sigma12 + mue2 * sigma22;
mue0 *= pt0; mue1 *= pt1; mue2 *= pt2;
capacity_t cost = (mue0 + mue1 + mue2) / 2.0 + vecLogTerm[kk_real];
Now the calculations for every pixel only need 150ms!