Why is OpenCV's sepFilter2D slower than filter2D and both are slower than a sequential application of 1D filters?

Question

I'm a looking for a way to compute 2D convolutions/correlations fast on large images, preferably in Python. My filters can be made separable, though I incur some added computational cost. The fastest way to compute a 2D convolution that I have found so far is using OpenCV. However, their separable-filter function, sepFilter2D, is slower than the non-separable function. Here are the timings I get on a 4-core laptop:

import cv2
import numpy as np
import timeit
M = 2048
N = 8192
K = 25
A = np.random.randn(M, N).astype('float32')
b = np.random.randn(1, K).astype('float32')
c = np.random.randn(K, 1).astype('float32')
bc = c*b
X = cv2.filter2D(A, -1, bc, borderType=cv2.BORDER_CONSTANT)
Y = cv2.sepFilter2D(A, -1, b, c, borderType=cv2.BORDER_CONSTANT)
Z = cv2.filter2D(cv2.filter2D(A, -1, b, borderType=cv2.BORDER_CONSTANT), -1, c, borderType=cv2.BORDER_CONSTANT)

# check that all give the same result
assert np.linalg.norm(X-Y)/np.linalg.norm(X) < 1e-6
assert np.linalg.norm(X-Z)/np.linalg.norm(X) < 1e-6

%timeit cv2.filter2D(A, -1, bc, borderType=cv2.BORDER_CONSTANT)
%timeit cv2.sepFilter2D(A, -1, b, c, borderType=cv2.BORDER_CONSTANT)
%timeit cv2.filter2D(cv2.filter2D(A, -1, b, borderType=cv2.BORDER_CONSTANT), -1, c, borderType=cv2.BORDER_CONSTANT)

309 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
445 ms ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
123 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In the last version I tried running two 1D filters sequentially, and this is faster than the other two methods. Why is this? Even then, I was hoping to get a larger benefit from the separable filter. Is there a faster way to compute a 2D convolution with a separable filter?

FYI the results may vary depending on your specific setup. I tested a similar code on an M1 Macbook and got: `single: 210ms/loop`, `sepFilter: 99ms/loop`, `double: 68ms/loop` ran with 100 loops each. — mimocha, Feb 15 '23 at 15:50
The reason that [cv2.filter2D](https://docs.opencv.org/4.x/d4/d86/group__imgproc__filter.html#ga27c049795ce870216ddfb366086b5a04) is so (relatively) fast is that the function uses "DFT-based algorithm in case of sufficiently large kernels (~11x11 or larger)". The reason that `cv2.sepFilter2D` is so slow is probably due to poor optimizations. It would be interesting comparing the execution time to [ippiFilterSeparable](https://www.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-2-image-processing/filtering-functions-2/separable-filters/filterseparable.html). — Rotem, Feb 15 '23 at 22:22
Thanks for the reference to ippiFilterSeparable - looks interesting. About the use of FFT, when I try to compute ifft2(ifft(A)) it takes 346 ms using fftw, and this is without transforming the kernel and doing the multiplication. So perhaps OpenCV have faster FFT? — Amit Hochman, Feb 16 '23 at 08:53
OpenCV’s FFT is not as fast as NumPy’s (PocketFFT), in my experience, unless they changed the implementation recently. I’m guessing this is just one more case of disappointing choices in OpenCV. — Cris Luengo, Feb 16 '23 at 14:35

Why is OpenCV's sepFilter2D slower than filter2D and both are slower than a sequential application of 1D filters?

0 Answers0