For my research, I have a lot of different images A, which I want to convolve with kernel B as fast as possible. The images are (M x N) and the kernel (M x P), in the normal convolution (which I have implemented right now) I slide them over each other in the 'x' direction resulting in a 1-dimensional image (1 x (N + P - 1)). Which produces the correct result. N and P are quite big (order 5000) and I want to speed up this process as I have to repeat it with a lot of different images.
One way I was thinking about is taking the FFT along the 'x' direction of both the kernel and image (after padding), multiplying them (element wise), IFFT and taking the sum over the columns. This should work, but I was thinking if I could also do a 2D convolution. Could taking the 2D FFT and than only multiplying the middle row? Which could maybe work, but how would you then do the padding, also which rows do you multiply exactly ( the zero'th frequencies?).
I'm currently working on some test cases, but have not found a proper answer yet, will update if I know more.
Would love to know what you guys think.
PS: the test case I'm building in python, but in the end I want to implement it with CUDA in C++ to really make it fast.