Why is OpenCV GPU CUDA template matching so much slower than CPU?

Question

I have compiled the newest available OpenCV 4.5.4 version for use with the newest CUDA 11.5 with fast math enabled running on a Windows 10 machine with a GeForce RTX 2070 Super graphics card (7.5 arch). I'm using Python 3.8.5.

Runtime results:

CPU outperforms GPU (matching a 70x70 needle image in a 300x300 source image)
biggest GPU bottleneck is the need to upload the files to the GPU before template matching
CPU takes around 0.005 seconds while the GPU takes around 0.42 seconds
Both methods end up finding a 100% match

Images used:

Source image

Needle image

Python code using CPU:

import cv2
import time

start_time = time.time()
src = cv2.imread("cat.png", cv2.IMREAD_GRAYSCALE)
needle = cv2.imread("needle.png", 0)

result = cv2.matchTemplate(src, needle, cv2.TM_CCOEFF_NORMED)
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(result)
print("CPU --- %s seconds ---" % (time.time() - start_time))

Python code using GPU:

import cv2
import time

start_time = time.time()
src = cv2.imread("cat.png", cv2.IMREAD_GRAYSCALE)
needle = cv2.imread("needle.png", 0)

gsrc = cv2.cuda_GpuMat()
gtmpl = cv2.cuda_GpuMat()
gresult = cv2.cuda_GpuMat()

upload_time = time.time()
gsrc.upload(src)
gtmpl.upload(needle)
print("GPU Upload time --- %s seconds ---" % (time.time() - upload_time))

match_time = time.time()
matcher = cv2.cuda.createTemplateMatching(cv2.CV_8UC1, cv2.TM_CCOEFF_NORMED)
gresult = matcher.match(gsrc, gtmpl)
print("GPU Match time --- %s seconds ---" % (time.time() - match_time))

result_time = time.time()
resultg = gresult.download()
min_valg, max_valg, min_locg, max_locg = cv2.minMaxLoc(resultg)
print("GPU Result time --- %s seconds ---" % (time.time() - result_time))
print("GPU --- %s seconds ---" % (time.time() - start_time))

Even if I wouldn't take the time it takes to upload the files to the GPU into consideration the matching time alone takes more than 10x of the whole process on the CPU. My CUDA is installed correctly, I have run other tests where the GPU outperformed the CPU by a lot, but the results for template matching are really disappointing so far.

Why is the GPU performing so badly?

It is possible that the CPU and GPU use different template matching approaches. I know that the CPU one use DFT and other optimizations and is very optimized for speed on the CPU. I do not know what approach the GPU uses. — fmw42, Dec 16 '21 at 05:12
GPUs have latency. that's a well known fact in GPU programming. — Christoph Rackwitz, Dec 16 '21 at 07:40
Exclude initialization of the matcher from time measurement and if possible run the matching once without time measurement and later with tine measurent to make sure there is no more just-in-time-compile active (which can happen for the first call). If possible: measure the total time of a lot of runs. — Micka, Dec 16 '21 at 08:05
Thanks @Micka I already did that. I looped only the matching line over 100 times without further initializations and took the average, the CPU still ended up being 10x faster. — r00flr00fl, Dec 16 '21 at 10:50
the code you present conflates `cv2.cuda.createTemplateMatching` and `matcher.match`. move one _out_ of the timing. -- it's possible that OpenCV code isn't perfect. it might still do copies and whatnot even in the single "matching" call. — Christoph Rackwitz, Dec 16 '21 at 11:29
Your images are less than 300KB. With PCIe 3.0 x16 you have a transfer duration of about 20 microseconds. This explains 0.005% of the additional time. Only 99.995% to go. — Sebastian, Dec 18 '21 at 13:35
You can run Compute Nsight to see, what routines your GPU runs and how long they take. — Sebastian, Dec 18 '21 at 13:36

zeroGwannabee · Answer 1 · 2021-12-16T18:19:46.567

In answer to your question:

You said that other tasks were better suited to the GPU. I read the Python CUDA documentation. It suggests that you are correct. Some tasks are better suited to the CPU and some are better suited to the CPU. Without getting into registries and stuff I would have to learn to tell you, I can say that what you write makes sense in reference to the documentation.
I don't see the actual times here. Also, it seems that this bottleneck is to be expected: the CPU is on the motherboard, soldered on with a better connection to the memory. The GPU is a card, attached with an extended plug that has limitations that a motherboard doesn't. Also, it is not really a troublesome bottleneck because it is not congested. 2.1 Someone else wrote that my hypothesis about the computer's design is "nonsense." I beg to differ. This tiny dataset could point out the difference in speed between the card and the board. The data originates on the card, then goes to the board's registries. That tiny set could easily make the difference noticeable.
What I have read about architecture and the CUDA documentation suggests that your results are not abnormal. The CUDA modules might be better used with large datasets. The advantage provided is that the GPU and CPU can work simultaneously, not in competition.

sorry but that's just **nonsense**. (1) is an empty phrase (2) has nothing at all to do with the physical connection or "soldering" (3) no, the issue here is lack of **pipelining**. usage of GPUs is known to involve latency for various setup tasks. large data sets just hide the programming flaws better. one can run small data sets effectively too, just don't expect setup costs to disappear. — Christoph Rackwitz, Dec 16 '21 at 07:43
I disagree but am interested to learn that you believe the bottleneck to be pipelining, and that it is a bottleneck. I don't see a programming flaw here: this tiny dataset isn't what the Python CUDA modules are designed to handle. It isn't hiding a flaw to work better with what is designed to do. — zeroGwannabee, Dec 16 '21 at 18:15
After reviewing this post, I politely agree with Mr. Rackwitz that half a second could not be explained by hardware, but don't retract anything I wrote as "nonsense." — zeroGwannabee, Dec 16 '21 at 19:02

Why is OpenCV GPU CUDA template matching so much slower than CPU?

1 Answers1