2

I'm trying to write a convolution function for a GPU using OpenCL.

Benchmarking shows that the GPU's data load instructions are very expensive and the run time scales linearly with the total number of LD instructions, indicating the GPU as little or no cache.

This causes convolutions with small- and medium-sized kernels (~48) very inefficient (~1% of the peak GFLOPS).

Is there a particular convolution algorithm, or a FFT algorithm that maximizes data reuse in registers (up to 64 float4 registers are available) and optimized for memory access?

Update: Floating point is preferred.

user3528438
  • 2,737
  • 2
  • 23
  • 42

0 Answers0