OpenCL 1D strided convolution performance

Question

For downsampling a signal, I use a FIR filter + decimation stage (that's practical a strided convolution). The big advantage of combining filtering and decimation is the reduced computational cost (by the decimation factor).

With a straight forward OpenCL implementation, I am not able to benefit from the decimation. Quite to the contrary: The convolution with a decimation factor of 4 is 25% slower than the full convolution.

Kernel Code:

__kernel void decimation(__constant float *input,
                         __global   float *output,
                         __constant float *coefs,
                         const int taps,
                         const int decimationFactor) {

    int posOutput = get_global_id(0);
    float result = 0;

    for (int tap=0; tap<taps; tap++) {
        int posInput = (posOutput * decimationFactor) - tap;
        result += input[posInput] * coefs[tap];
    }

    output[posOutput] = result;
}

I guess it is due to the uncoalesced memory access. Though I can not think of a solution to fix the problem. Any ideas?

Edit: I tried Dithermaster's solution to split the problem into coalesced reads to shared local memory and convolution from local memory:

__kernel void decimation(__constant float *input,
                        __global   float *output,
                        __constant float *coefs,
                        const int taps,
                        const int decimationFactor,
                        const int bufferSize,
                        __local float *localInput) {

    const int posOutput = get_global_id(0);
    const int localSize = get_local_size(0);
    const int localId   = get_local_id(0);
    const int groupId   = get_group_id(0);

    const int localInputOffset  = taps-1;
    const int localInputOverlap = taps-decimationFactor;
    const int localInputSize    = localInputOffset + localSize * decimationFactor;

    // 1. transfer global input data to local memory
    // read global input to local input (only overlap)
    if (localId < localInputOverlap) {
        int posInputStart = ((groupId*localSize) * decimationFactor) - (taps-1);
        int posInput      = posInputStart + localId;
        int posLocalInput = localId;

        localInput[posLocalInput] = 0.0f;
        if (posInput >= 0)
            localInput[posLocalInput] = input[posInput];
    }

    // read remaining global input to local input
    // 1. alternative: strided read
    // for (int i=0; i<decimationFactor; i++) {
    //     int posInputStart = (groupId*localSize) * decimationFactor;
    //     int posInput      = posInputStart    + localId * decimationFactor - i;
    //     int posLocalInput = localInputOffset + localId * decimationFactor - i;

    //     localInput[posLocalInput] = 0.0f;
    //     if ((posInput >= 0) && (posInput < bufferSize*decimationFactor))
    //         localInput[posLocalInput] = input[posInput];
    // }

    // 2. alternative: coalesced read (in blocks of localSize)
    for (int i=0; i<decimationFactor; i++) {
        int posInputStart = (groupId*localSize) * decimationFactor;
        int posInput      = posInputStart    - (decimationFactor-1) + i*localSize + localId;
        int posLocalInput = localInputOffset - (decimationFactor-1) + i*localSize + localId;

        localInput[posLocalInput] = 0.0f;
        if ((posInput >= 0) && (posInput < bufferSize*decimationFactor))
            localInput[posLocalInput] = input[posInput];
    }

    // 2. wait until every thread completed
    barrier(CLK_LOCAL_MEM_FENCE);

    // 3. convolution
    if (posOutput < bufferSize) {
        float result = 0.0f;
        for (int tap=0; tap<taps; tap++) {
            int posLocalInput = localInputOffset + (localId * decimationFactor) - tap;

            result += localInput[posLocalInput] * coefs[tap];
        }

        output[posOutput] = result;
    }
}

Big improvement! But still, the performance does not correlate with the overall operations (not proportional to the decimation factor):

speedup for full convolution compared to first approach: ~12 %
computatoin time for decimation compared to full convolution:
- decimation factor 2: 61 %
- decimation factor 4: 46 %
- decimation factor 8: 53 %
- decimation factor 16: 68 %

The performance has a optimum for a decimation factor of 4. Why is that? Any ideas for further improvements?

Edit 2: Diagram with shared local memory:

Edit 3: Comparison of the performance for the 3 different implementations

score 3 · Accepted Answer · answered Sep 17 '18 at 14:18

3

Due to the amount of data overlap (66%), this could benefit from sharing data read from memory between work items, within a workgroup. You could get rid of redundant reads and also make coalesced reads. Break you kernel up into two parts: The first part does coalesced reads for all the data needed within the work group, into shared local memory. Then a memory barrier to synchronize. Then in the second part do the convolutions using reads from shared local memory.

P.S. Thanks for the diagram, it helped me understand your goal more quickly than trying to read code.

answered Sep 17 '18 at 14:18

Dithermaster

6,223
1
12
20

Great idea. I implemented your idea with good results. If the decimation factor becomes greater than 4, the performance decreases again (see edit above). Could you think of a reason and possible solution? – luxderfux Sep 17 '18 at 17:17
1

Wow, nice work! Looking over the code, I'm not sure it's getting coalesced reads. It seems to me that the first half of the kernel should only be using get_local_id and get_group_id (to compute shared memory location and global memory read location), and the second half should only use get_global_id and get_group_id (to compute shared local memory lookup). I'm seeing global_id in the first part, and I'm confused by that. To get coalesced reads, adjacent work items (local_id differs by 1) should be reading adjacent global memory locations (index differs by 1). – Dithermaster Sep 17 '18 at 20:27
I added a diagram with the local memory solution. I guess you are right. Because I fill the local memory in a for loop, I access the global memory in a strided manner. Still, I can't think of a appropiate solution. So I guess I only benefit from the fast memory access right now. Now it would be awesome to get the second performance boost due to coalesced global reads :) – luxderfux Sep 18 '18 at 07:28
In the space of a comment it's hard to describe, but basically you want to share the work of copying from global to local across the work items within a work group, such that local_id 0 reads the first item, local_id 1 reads the second, etc. (in order to get coalesced reads). If the number of items is greater than the size of the work group, you'll need a loop so it keeps reading items until the are all read. You'll also need an ending conditional so you don't read too much. It's a common pattern in kernels that use shared local memory; maybe you can find some to study. – Dithermaster Sep 18 '18 at 16:21
It think I figured it out, thanks! I had a big suprise though: I ran the same benchmark on windows with the lates Radeon driver and OpenCL 2.0 support with performance gain of >50%. That's huge! And the naive OpenCL kernel runs almost with the same speed as the optimized one. So I guess there are some optimizations for strided global memory access with the official driver or OpenCL 2.0 compared to the Mesa Clover driver and OpenCL 1.1? – luxderfux Sep 18 '18 at 17:49

OpenCL 1D strided convolution performance

1 Answers1