0

I have determined with the "Random-Stop-Method" that the following two lines appear to be very slow:

cv::Mat pixelSubMue = pixel - vecMatMue[kk_real];   // ca. 35.5 %
cv::Mat pixelTemp = pixelSubMue * covInvRef;        // ca. 58.1 %
cv::multiply(pixelSubMue, pixelTemp, pixelTemp);    // ca. 0 %
cv::Scalar sumScalar = cv::sum(pixelTemp);          // ca. 3.2 %

double cost = sumScalar.val[0] * 0.5 + vecLogTerm[kk_real]; // ca. 3.2 %
  • vecMatMue[kk_real] is a std::vector<cv::Mat> <- I know there is a lot of copying involved, but using pointers does not change a lot in performance here
  • pixelSubMue is a cv::Mat(1, 3, CV_64FC1) vector
  • covInvRef is a reference to a cv::Mat(3, 3, CV_64FC1) matrix
  • vecLogTerm[kk_real] is a std::vector<double>

The code snippet above is in an inner loop, that is called millions of times.

Question: Is there a way to improve the speed of that operation?

Edit: Thanks for the comments! I have now measured the time within the program and the percentages indicate how much of the time is spent on each line. The measurements were done in Release mode. I have done six measurements, each time the code was executed millions of times.

I should probably also mention, that the std::vector objects have no effect on the performance, I did just replace them with constant objects.

Edit 2: I have also implemented the algorithm using the C-Api. The relevant lines look like this now:

cvSub(pixel, vecPMatMue[kk], pixelSubMue);                   // ca. 24.4 %
cvMatMulAdd(pixelSubMue, vecPMatFCovInv[kk], 0, pixelTemp);  // ca. 39.0 %
cvMul(pixelSubMue, pixelTemp, pixelSubMue);                  // ca. 22.0 %
CvScalar sumScalar = cvSum(pixelSubMue);                     // ca. 14.6 %
cost = sumScalar.val[0] * 0.5 + vecFLogTerm[kk];             // ca. 0.0 %

The C++ implementation needs for the same input data ca. 3100 msec while the C-Implementation needs only ca. 2050 msec (both measurements refer to the total time for executing the snippet millions of times). But I still prefer my C++ implementation, since it is easier to read for me (other "ugly" changes had to be made to make it work with the C-API).

Edit 3: I have rewritten the code without using any function calls for the actual calculations:

capacity_t mue0 = meanRef.at<double>(0, 0);
capacity_t mue1 = meanRef.at<double>(0, 1);
capacity_t mue2 = meanRef.at<double>(0, 2);

capacity_t sigma00 = covInvRef.at<double>(0, 0);
capacity_t sigma01 = covInvRef.at<double>(0, 1);
capacity_t sigma02 = covInvRef.at<double>(0, 2);
capacity_t sigma11 = covInvRef.at<double>(1, 1);
capacity_t sigma12 = covInvRef.at<double>(1, 2);
capacity_t sigma22 = covInvRef.at<double>(2, 2);

mue0 = p0 - mue0; mue1 = p1 - mue1; mue2 = p2 - mue2;

capacity_t pt0 = mue0 * sigma00 + mue1 * sigma01 + mue2 * sigma02;
capacity_t pt1 = mue0 * sigma01 + mue1 * sigma11 + mue2 * sigma12;
capacity_t pt2 = mue0 * sigma02 + mue1 * sigma12 + mue2 * sigma22;

mue0 *= pt0; mue1 *= pt1; mue2 *= pt2;

capacity_t cost = (mue0 + mue1 + mue2) / 2.0 + vecLogTerm[kk_real];

Now the calculations for every pixel only need 150ms!

bjoernz
  • 3,852
  • 18
  • 30
  • 1
    You're sure you are compiling in Release mode and you're using the Release mode OpenCV DLLs and LIBs? – Jacob Aug 08 '11 at 15:01
  • I compile in debug mode... I don't really know how to profile the code in Release mode. – bjoernz Aug 08 '11 at 16:50
  • Never profile code in debug mode because it is /always/ slower and may have different performance characteristics. Profile in Release and compile with SSE enabled if your platform supports it. – Jasper Bekkers Aug 10 '11 at 10:25
  • Thanks for the advice. The time measurements were done in release mode. I have not yet played with compiler flags yet. – bjoernz Aug 10 '11 at 10:49

1 Answers1

1

It looks like you're compiling Debug mode which probably explains the performance hit. You can profile your code using time functions such as clock().

E.g.

clock_t start,end;
...
start = clock();
cv::Mat pixelTemp = pixelSubMue * covInvRef;    // Very SLOW!
end = clock();

cout<<"Elapsed time in seconds: "<<(static_cast<double>(end)-start)/CLK_TCK<<endl;
Jacob
  • 34,255
  • 14
  • 110
  • 165
  • 1
    Do not use `clock()`. Instead, use a platform-specific high resolution timer like the High Performance Timer. – Puppy Aug 08 '11 at 17:13
  • 2
    @DeadMG: In a professional setup you may have a high resolution timer availabl. Otherwise, for examples and for getting a Good Enough(TM) rough idea, one may just place the relevant code in a loop, and time it with `clock`. Adjusting the loop as necessary, and running at least three times to see variability due to system load (this latter applies in Windows, where `clock` is wall clock). – Cheers and hth. - Alf Aug 08 '11 at 22:05
  • 1
    Thanks for your suggestion. I have now measured the times in various configurations. Any idea how to improve the speed of the vector-matrix-multiplication? – bjoernz Aug 09 '11 at 14:02
  • @bjoernz: Also, are you using Release mode OpenCV libraries? E.g. `cxcore200d.lib` is the Debug mode library (note the `d` at the end) and the corresponding release mode library is `cxcore200.lib` – Jacob Aug 09 '11 at 14:27
  • Also, what is the size of `pixelSubMue` and `covInvRef`? – Jacob Aug 09 '11 at 14:32
  • I use the stock libs on Ubuntu (not debug builds). The size of `pixelSubMue` and `covInvRef` are 1x3 and 3x3 respectively. – bjoernz Aug 09 '11 at 15:13
  • Wait, are you saying one 1x3 3x3 multiplication takes 3100ms on average? – Jacob Aug 09 '11 at 15:18
  • No, it is part of an image segmentation algorithm. 3100ms refers to the execution time of applying that code snipped five time to every pixel in the image. – bjoernz Aug 09 '11 at 17:01
  • The particular image that I use for performance tests is 1504x1000 pixel. – bjoernz Aug 10 '11 at 10:31