NumPy matrix multiplication is 20X slower than OpenCV's cvtColor

Question

OpenCV converts BGR images to grayscale using the linear transformation Y = 0.299R + 0.587G + 0.114B, according to their documentation.

I tried to mimic it using NumPy, by multiplying the HxWx3 BGR matrix by the 3x1 vector of coefficients [0.114, 0.587, 0.299]', a multiplication that should result in a HxWx1 grayscale image matrix.

The NumPy code is as follows:

import cv2
import numpy as np
import time

im = cv2.imread(IM_PATHS[0], cv2.IMREAD_COLOR)
# Prepare destination grayscale memory
dst = np.zeros(im.shape[:2], dtype = np.uint8)
# BGR -> Grayscale projection column vector
bgr_weight_arr = np.array((0.114,0.587,0.299), dtype = np.float32).reshape(3,1)
for im_path in IM_PATHS:
    im = cv2.imread(im_path , cv2.IMREAD_COLOR)
    t1 = time.time()
    # NumPy multiplication comes here
    dst[:,:] = (im @ bgr_weight_arr).reshape(*dst.shape)
    t2 = time.time()
    print(f'runtime: {(t2-t1):.3f}sec')

Using 12MP images (4000x3000 pixels), the above NumPy-powered process typically takes around 90ms per image, and that is without rounding the multiplication results.

On the other hand, when I replace the matrix multiplication part by OpenCV's function: `dst[:,:] = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)`, the typical runtime I get is around `5ms` per image. I.e., 18X faster!

Can anyone explain how is that possible? I have always been taught to believe that NumPy uses all available acceleration techniques, such as SIMD. So how can OpenCV get so dramatically faster?

Update:

Even when using quantized multiplications, NumPy's runtimes stay at the same range, around 90ms...

rgb_weight_arr_uint16 = np.round(256 * np.array((0.114,0.587,0.299))).astype('uint16').reshape(3,1)
for im_path in IM_PATHS:
    im = cv2.imread(im_path , cv2.IMREAD_COLOR)
    t1 = time.time()
    # NumPy multiplication comes here
    dst[:,:] = np.right_shift(im @ bgr_weight_arr_uint16, 8).reshape(*dst.shape)
    t2 = time.time()
    print(f'runtime: {(t2-t1):.3f}sec')

People posted very interesting comments on the deleted answer below. So is it possible that OpenCV uses something like 3 lookup tables of 256 grayscale values, one for each of the coefficients 0.114, 0.587, 0.299 ? Thus they can just add values (even of float type) and avoid multiplying float32 numbers — SomethingSomething, Feb 15 '23 at 16:23
You can have a look at OpenCVs `cvtColor` code here: https://github.com/opencv/opencv/blob/4.x/modules/imgproc/src/color_rgb.simd.hpp#L645 (specifically linking to the method which is called for 8bit input images). It uses these multipliers which are indeed 16bit integers: https://github.com/opencv/opencv/blob/4.x/modules/imgproc/src/color.simd_helpers.hpp#L19 — chtz, Feb 15 '23 at 17:00
@SomethingSomething: Probably not lookup tables, probably 16-bit integer approximations of the scale factors. I assume the answer was deleted by a moderator because it was written by ChatGPT, not a human, which is why it discussed matmul which this code doesn't do. Quang Hoang and Jérôme Richard's pointed out that your NumPy code copies and converts to float32 (which isn't super fast and costs a lot of memory bandwidth), which also prevents NumPy from using an implementation with fast integer multiply-add instructions like `pmaddubsw` or `pmaddwd`. — Peter Cordes, Feb 15 '23 at 20:59
I just ran an experiment with 16bit unsigned integers - the conversion took the same time... preprocessing step: multiplied the weight column vector by 256, rounded and saved the result as a uint16 Numpy array. Then, given a BGR image (uint8), I ran the multiplication the same way I did with float (should cast everything to uint16 now) and finally used "shift right" of 8 bits, to quickly divide the result by 256. The resulting images are beautiful Grayscale images, but the runtimes did not change at all... — SomethingSomething, Feb 16 '23 at 11:16

NumPy matrix multiplication is 20X slower than OpenCV's cvtColor

On the other hand, when I replace the matrix multiplication part by OpenCV's function: dst[:,:] = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY), the typical runtime I get is around 5ms per image. I.e., 18X faster!

Update:

0 Answers0

On the other hand, when I replace the matrix multiplication part by OpenCV's function: `dst[:,:] = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)`, the typical runtime I get is around `5ms` per image. I.e., 18X faster!