0

I am using cv::matchTemplate to track a moving object in a video.

However, running the template matching of open cv with a small picture can be slower on a better/newer intel's CPU. The code snippet below run typically 2 times slower on a i9-7920x (0.28ms/match) than a i7-9700k (0.14ms/match).

#include <chrono>
#include <fstream>

#include <opencv2/opencv.hpp>

#pragma optimize("", off)

int main()
{
    cv::Mat haystack;
    cv::Mat needle;
    cv::Mat result;

    cv::Rect rect;
    //https://en.wikipedia.org/wiki/Barack_Obama#/media/File:President_Barack_Obama.jpg
    haystack = cv::imread("C:/President_Barack_Obama.jpg");
    rect.width = 64;
    rect.height = 64;
    haystack = haystack(rect);
    rect.width = 12;
    rect.height = 12;
    rect.x = 50;
    rect.y = 50;
    needle = haystack(rect);

    auto start = std::chrono::high_resolution_clock::now();

    int nbmatch = 10000;
    for (int i = 0; i < nbmatch; i++) {
        cv::matchTemplate(haystack, needle, result, cv::TemplateMatchModes::TM_CCOEFF_NORMED);
    }

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> diff = end - start;
    std::cout << "time per match: " << (diff.count() / nbmatch) * 1000 << " ms\n";
    std::this_thread::sleep_for(std::chrono::seconds(500));
}

In my real application, I noticed this:

  • i7-9700k: 1ms;
  • i7-6800k: 1.3ms;
  • i9-7920x: 2.8ms;
  • i9-9820x: 2.8ms.

Both the i9 are slower by a fair amount that could not be explained by the slight difference in clock speed.

Win 7 or 10 does not make a difference. It is compiled with Visual Studio 2019 (v142). Open CV is compiled from the source with the pre-built libraries (building it myself did not help).


Edit: The capacity to scale the frequency seems to have an important impact. If runned single threaded the i9-7920x still run in 2.8ms if I sleep regularily but if I yield instead (cpu load of 100%) it lower to 1.9ms.


Question:

What could explain this?

Do you think it is possible to bring all processor to compute in the same range of time using cv::matchTemplate?

What could I do else to reduce my computation time?

pepece
  • 360
  • 5
  • 17
  • 1
    i7-9700k has iGPU. Do you can check in Task manager, may be cv::matchTemplate works on iGPU? – Nuzhny Feb 07 '20 at 06:30
  • @Nuzhny I would be surprised if this was the explaination because my iGPU is disable and It does not appear in the device manager. Moreover, open cv has a gpu version of matchTemplate but it is not the one I am using now. – pepece Feb 07 '20 at 10:10
  • 1
    How many cpu cores the program takes to run on each cpu? Is the cpu usage the same in both cpus? – MeiH Feb 07 '20 at 12:27
  • @MH304 It is a multithreaded programm without specific cores affinity. It spreads among all the cores, the load on CPU is shown (in task manager for instance) as around 20%. CPU usage is fairly low in all (relative to CPU capacity). The "weaker" CPU which usually run faster, are the first to get overwhelmed if I fake an heavy load (more tracking). – pepece Feb 07 '20 at 12:39
  • I want to see Performane tab from Task Manager in both cases – Nuzhny Feb 07 '20 at 12:46
  • 1
    Task manager is not a perfect tool for checking cpu usage. It's better to check the cpu performance with a simple single threaded program, Otherwise the program is to blame. Although it may be related to the SSE2 or SSE4 implementation in opencv which I'm not expert in it. – MeiH Feb 07 '20 at 13:44
  • @Nuzhny [i9-9720x](https://ibb.co/NN7D0ZF) [i7-9700k](https://ibb.co/2WMpKB0), the load might be a bit higher on the 9700k because the hyperthreading is disabled and there are other stuff running. – pepece Feb 07 '20 at 14:15
  • @MH304 Good point. I also notice that the frequency scaling is important look at my edit. – pepece Feb 07 '20 at 14:21
  • 1
    Memory latency is very different between those CPUs: DRAM and L3 latency on i7-9700k (quad-core "client" with a small ring bus between cores/L3 slices and memory controllers) is much better than DRAM and L3 latency on a Skylake-X (many-core "server" with SKX's mesh network). I'm guessing your workload touches more than 1MiB of memory (L2 cache size on SKX)? That may be combining with the sleep is having a bigger effect on one machine than the other. [Why does this delay-loop start to run faster after several iterations with no sleep?](//stackoverflow.com/q/38299023) – Peter Cordes Feb 07 '20 at 14:32
  • 1
    If Windows bounces your task around between cores; also keep in mind that SKX can scale CPU frequency for each core separately. But "client" chips have all cores and L3 locked to the same frequency, so migrating to another core doesn't mean having to wait for that core to decide to turbo up to max frequency. If you were using Linux, I'd suggest using `perf stat` to see the average CPU frequency of cores running your program (using perf counters for hardware cycles and the OS's task-clock.) Possibly check your energy_performance_preference frequency scaling settings. – Peter Cordes Feb 07 '20 at 14:35
  • 1
    Also, i9-7920X has AVX512; if software uses that some, it can limit max turbo. Or possibly if it uses AVX-512 inefficiently, it could make things worse than AVX2. – Peter Cordes Feb 07 '20 at 14:37
  • 1
    re: memory latency: that also means single-thread memory *bandwidth* is much worse on SKX CPUs than on "client" chips: [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](//stackoverflow.com/q/39260020)/. And quad-core client Skylake vs. dual-*socket* Skylake Xeon (48 cores): [memory bandwidth for many channels x86 systems](//stackoverflow.com/q/56803987). Dual socket introduces the problem of snooping between sockets, but your SKX HEDT might be more similar to the Xeon than the client chip. – Peter Cordes Feb 07 '20 at 14:42
  • Thx a lot @PeterCordes for all your relevant comment. I am investigating. I also wish I was on linux for profiling =) – pepece Feb 07 '20 at 15:26
  • 1
    You *could* boot Linux from a USB stick for a profiling run, if you have a few specific experiments in mind you want to try with `perf`. Or install VTune under Windows. – Peter Cordes Feb 08 '20 at 06:21

0 Answers0