(opencv rc1) What causes Mat multiplication to be 20x slower than per-pixel multiplication?

Question

// 700 ms
cv::Mat in(height,width,CV_8UC1);
in /= 4;

Replaced with

//40 ms
cv::Mat in(height,width,CV_8UC1);
for (int y=0; y < in.rows; ++y)
{
    unsigned char* ptr = in.data + y*in.step1();
    for (int x=0; x < in.cols; ++x)
    {
        ptr[x] /= 4;
    }
}

What can cause such behavior? Is it due to opencv "promoting" Mat with Scalar multiplication to a Mat with Mat multiplication, or is it a specific failed optimization for arm? (NEON is enabled).

can you try in *= 1.0f/4.0; ? you didnt initialize the elements btw — Micka, May 11 '15 at 12:08
The results of floating-point multiplication are identical to integer division to about 20% uncertainty/difference in my tests on both per-pixel and whole mat multiplication. — Boyko Perfanov, May 11 '15 at 12:24
Can you run perf? https://perf.wiki.kernel.org/index.php/Main_Page — auselen, May 11 '15 at 12:29
maybe you can confirm it, but it looks like cv::Mat only has `operator /` for double precision scalar value. So your machine's double precision division computation might be about 20 times slower than int division? — Micka, May 11 '15 at 12:46
are you using openCV debug or release libraries? Do your compile your code in debug or release mode? and with/without optimizations, forced-single-precision etc? What kind of matrix sizes are we talking about? — Micka, May 11 '15 at 15:08
Why don't you post the disassemblies? Without them, we can only guess. — Jake 'Alquimista' LEE, May 16 '15 at 03:13

score 2 · Answer 1 · answered May 11 '15 at 13:19

2

This is a very old issue (I reported it couple of years ago) that many basic operations are taking extra time. Not just division but also addition, abs, etc... I don't know the real reason for that behavior. What is even more weird, is that the operations that supposed to take more time, like addWeighted, are actually very efficient. Try this one:

addWeighted(in, 1.0/4, in, 0, 0, in);

It performs multiple operations per pixel yet it run few times faster than either add function and loop implementation.

Here is my report on bug tracker.

answered May 11 '15 at 13:19

Michael Burdinov

4,348
1
17
28

For my setup and opencv version, addWeighted is equivalently slow, which is yet another problem. – Boyko Perfanov May 11 '15 at 13:40
This is even weirder, because I just checked it and saw that addWeighted is much faster. What version of OpenCV you are using? – Michael Burdinov May 11 '15 at 14:14
Also can you please make same experiment with some other function? abs() for example. – Michael Burdinov May 11 '15 at 14:14

score 1 · Answer 2 · answered May 11 '15 at 13:01

Tried the same by measuring cpu time.

int main()
{
    clock_t startTime;
    clock_t endTime;

    int height =1024;
    int width =1024;

    // 700 ms
    cv::Mat in(height,width,CV_8UC1, cv::Scalar(255));
    std::cout << "value: " << (int)in.at<unsigned char>(0,0) << std::endl;

    cv::Mat out(height,width,CV_8UC1);

    startTime = clock();
    out = in/4;
    endTime = clock();
    std::cout << "1: " << (float)(endTime-startTime)/(float)CLOCKS_PER_SEC << std::endl;
    std::cout << "value: " << (int)out.at<unsigned char>(0,0) << std::endl;


    startTime = clock();
    in /= 4;
    endTime = clock();
    std::cout << "2: " <<  (float)(endTime-startTime)/(float)CLOCKS_PER_SEC << std::endl;
    std::cout << "value: " << (int)in.at<unsigned char>(0,0) << std::endl;

    //40 ms
    cv::Mat in2(height,width,CV_8UC1, cv::Scalar(255));

    startTime = clock();
    for (int y=0; y < in2.rows; ++y)
    {
        //unsigned char* ptr = in2.data + y*in2.step1();
        unsigned char* ptr = in2.ptr(y);
        for (int x=0; x < in2.cols; ++x)
        {
            ptr[x] /= 4;
        }
    }
    std::cout << "value: " << (int)in2.at<unsigned char>(0,0) << std::endl;

    endTime = clock();
    std::cout << "3: " <<  (float)(endTime-startTime)/(float)CLOCKS_PER_SEC << std::endl;


    cv::namedWindow("...");
    cv::waitKey(0);
}

with results:

value: 255
1: 0.016
value: 64
2: 0.016
value: 64
3: 0.003
value: 63

you see that the results differ, probably because mat.divide() does perform floating point division and rounding to next. While you use integer division in your faster version, which is faster but gives a different result.

In addition, there is a saturate_cast in openCV computation, but I guess the bigger computation load difference will be the double precision division.

Can you add 4: multiply by 0.25 per element? IIRC that was also "fast" for me suggesting something else is going on than just flop/intop calculation performance. — Boyko Perfanov, May 11 '15 at 13:28
on my machine, multiplication/division by a float is about 2x the time of unsigned char division by 4 (which is a bitshift btw). Division by a double is the same as float, which I don't trust atm :) — Micka, May 11 '15 at 15:06

(opencv rc1) What causes Mat multiplication to be 20x slower than per-pixel multiplication?

2 Answers2