I'm using OpenCV for an application in image processing. I'd like to accelerate a lot of (a great many times) matrix operations (matrices are fairly large) on GPU and want to avoid coding directly in CUDA C, if possible. OpenCV 3.4.2 has a number of GPU accelerated functions such as cuda::multiply
. But it accelerates 'only one' matrix operation. So when I have lots of matrix operations, it will be time-consuming.
My code is described below. With CPU parallel utilizing the GPU functions, the GPU usage is lower than 5%. So I wonder if there is any way to improve? Is there any method for calling GPU functions in GPU parallel?
cuda::Stream stream;
const int size = 3427680;
const int iteration = 649319;
cv::Mat cpu_mat1 = cv::Mat(1, size, CV_32FC4, Scalar(1));
cv::Mat cpu_mat2 = cv::Mat(1, size, CV_32FC4, Scalar(1));
cv::Mat cpu_mat3 = cv::Mat::zeros(1, size, CV_32FC4);
cv::cuda::GpuMat gpu_mat1;
gpu_mat1.upload(cpu_mat1);
cv::cuda::GpuMat gpu_mat2;
gpu_mat2.upload(cpu_mat2);
cv::cuda::GpuMat gpu_mat3;
gpu_mat3.upload(cpu_mat3);
#pragma omp parallel for
for(i=0;i<iteration;i++)
{
cuda::multiply(gpu_mat1, gpu_mat2, gpu_mat3, 1.0, -1, stream);
cuda::sum(gpu_mat3);
}