I wish to implement convolution on arm mali GPUs and want it to be optimised for both speed and memory ? What's the best way to do this? GEMM based MCMK convolutions are not suited as they utilise a lot of memory. Also, a direct implementation on GPU is way slower than the corresponding CPU version. Any time for memory reshaping should be taken into account for timing calculations.
Asked
Active
Viewed 71 times
1
-
Did you try fourier transform based convolution? It is many times faster than naive convolution for filter width of 20-30 or more, especially best when convolution filter has same size with the image. – huseyin tugrul buyukisik Sep 13 '19 at 14:09
-
Well. My primary concern is regarding computer vision applications so at max the filter width will be 7 while the common kernel width will be 3 or 5 ! – Sep 14 '19 at 14:41
-
See https://github.com/ARM-software/ComputeLibrary for some pre-optimized implementations. – solidpixel Sep 27 '19 at 09:08