Convolution/FFT algorithm for a machine with slow memory and no cache?

Asked Dec 17 '15 at 17:48

Active Dec 17 '15 at 17:54

Viewed 141 times

I'm trying to write a convolution function for a GPU using OpenCL.

Benchmarking shows that the GPU's data load instructions are very expensive and the run time scales linearly with the total number of LD instructions, indicating the GPU as little or no cache.

This causes convolutions with small- and medium-sized kernels (~48) very inefficient (~1% of the peak GFLOPS).

Is there a particular convolution algorithm, or a FFT algorithm that maximizes data reuse in registers (up to 64 float4 registers are available) and optimized for memory access?

Update: Floating point is preferred.

edited Dec 17 '15 at 17:54

asked Dec 17 '15 at 17:48

user3528438

2,737
2
23
42

2

You could study the published versions of the FFT running on GPUs. – Andrew Morton Dec 17 '15 at 18:03

Convolution/FFT algorithm for a machine with slow memory and no cache?

0 Answers0