1

I'm teaching myself OpenCL by trying to optimize the mpeg4dst reference audio encoder. I achieved a 3x speedup by using vector instructions on CPU but I figured the GPU could probably do better.

I'm focusing on computing auto-correlation vectors in OpenCL as my first area of improvement. The CPU code is:

for (int i = 0; i < NrOfChannels; i++) {
    for (int shift = 0; shift <= PredOrder[ChannelFilter[i]]; shift++)
        vDSP_dotpr(Signal[i] + shift, 1, Signal[i], 1, &out, NrOfChannelBits - shift);
}
NrOfChannels = 6
PredOrder = 129
NrOfChannelBits = 150528.

On my test file, this function take approximately 188ms to complete.

Here's my OpenCL method:

kernel void calculateAutocorrelation(size_t offset,
                                 global const float *input,
                                 global float *output,
                                 size_t size) {
size_t index = get_global_id(0);
size_t end = size - index;
float sum = 0.0;

for (size_t i = 0; i < end; i++)
    sum += input[i + offset] * input[i + offset + index];

output[index] = sum;
}

This is how it is called:

gcl_memcpy(gpu_signal_in, Signal, sizeof(float) * NrOfChannels * MAXCHBITS);

for (int i = 0; i < NrOfChannels; i++) {
    size_t sz = PredOrder[ChannelFilter[i]] + 1;
    cl_ndrange range = { 1, { 0, 0, 0 }, { sz, 0, 0}, { 0, 0, 0 } };

    calculateAutocorrelation_kernel(&range, i * MAXCHBITS, (cl_float *)gpu_signal_in, (cl_float *)gpu_out, NrOfChannelBits);
    gcl_memcpy(out, gpu_out, sizeof(float) * sz);
}

According to Instruments, my OpenCL implementation seems to take about 13ms, with about 54ms of memory copy overhead (gcl_memcpy).

When I use a much larger test file, 1 minute of 2-channel music vs, 1 second of 6-channel, while the measured performance of the OpenCL code seems to be the same, the CPU usage falls to about 50% and the whole program takes about 2x longer to run.

I can't find a cause for this in Instruments and I haven't read anything yet that suggests that I should expect very heavy overhead switching in and out of OpenCL.

Tim
  • 4,560
  • 2
  • 40
  • 64
  • What are your system specs? GPU? OS? In general the kernel launch overhead is quite large on discrete GPU. And the worst launch overhead you can get is a discrete GPU on Windows Vista or newer (due to the driver model). In addition it seems your OpenCL kernel runs relatively few threads (128 threads is really low for GPU). Ideally your problem should have distinct threads (ndrange) on the order of thousands. On top of that memory copying is quite expensive over PciE bus. Try to write it so that instead of memcpy per loop iteration you copy the whole result out after all the kernel calls. – sharpneli Nov 24 '13 at 12:00
  • I'm on OSX 10.9 on a Macbook Pro with an NVidia GT 650M graphics card. I've removed the second gcl_memcpy and also the loop, and I'm learning about cl_ndrange to try and get more threads on the GPU. However, from my benchmarks I'm seeing 10ms to execute the kernel and 50ms in gcl_memcpy (which should be the PCIe transfer in both directions.) This should give me a good speedup, ~60ms vs ~190ms. Instead I'm seeing time "disappear" with the CPU being used at only 40% and the whole process taking almost twice the time it did before. – Tim Nov 24 '13 at 12:09

1 Answers1

2

If I'm reading your kernel code correctly, each work item is iterating over all of the data from it's location to the end. This isn't going to be efficient. For one (and the primary performance concern), the memory accesses won't be coalesced and so won't be at full memory bandwidth. Secondly, because each work item has a different amount of work, there will be branch divergence within a work group, which will leave some threads idle waiting for others.

This seems like it has a lot in common with a reduction problem and I'd suggest reading up on "parallel reduction" to get some hints about doing an operation like this in parallel.

To see how memory is being read, work out how 16 work items (say, global_id 0 to 15) will be reading data for each step.

Note that if every work item in a work group access the same memory, there is a "broadcast" optimization the hardware can make. So just reversing the order of your loop could improve things.

Dithermaster
  • 6,223
  • 1
  • 12
  • 20