Is there a performance penalty for CUDA method not running in sync?

Question

If i have a kernel which looks back the last Xmins and calculates the average of all the values in a float[], would i experience a performance drop if all the threads are not executing the same line of code at the same time?

eg: Say @ x=1500, there are 500 data points spanning the last 2hr period.

@ x = 1510, there are 300 data points spanning the last 2hr period.

the thread at x = 1500 will have to look back 500 places yet the thread at x = 1510 only looks back 300, so the later thread will move onto the next position before the 1st thread is finished.

Is this typically an issue?

EDIT: Example code. Sorry but its in C# as i was planning to use CUDAfy.net. Hopefully it provides a rough idea of the type of programming structures i need to run (Actual code is more complicated but similar structure). Any comments regarding whether this is suitable for a GPU / coprocessor or just a CPU would be appreciated.

public void PopulateMeanArray(float[] data)
{
    float lookFwdDistance = 108000000000f;
    float lookBkDistance = 12000000000f;
    int counter = thread.blockIdx.x * 1000;    //Ensures unique position in data is written to (assuming i have less than 1000 entries).
    float numberOfTicksInLookBack = 0;
    float sum = 0;    //Stores the sum of difference between two time ticks during x min look back.

    //Note:Time difference between each time tick is not consistent, therefore different value of numberOfTicksInLookBack at each position.
    //Thread 1 could be working here.
    for (float tickPosition = SDS.tick[thread.blockIdx.x]; SDS.tick[tickPosition] < SDS.tick[(tickPosition + lookFwdDistance)]; tickPosition++)
    {
        sum = 0;
        numberOfTicksInLookBack = 0;

        //Thread 2 could be working here. Is this warp divergence?
        for(float pastPosition = tickPosition - 1; SDS.tick[pastPosition] > (SDS.tick[tickPosition - lookBkDistance]); pastPosition--)
        {
            sum += SDS.tick[pastPosition] - SDS.tick[pastPosition + 1];
            numberOfTicksInLookBack++;
        }
        data[counter] = sum/numberOfTicksInLookBack;
        counter++;
    }
}

can you provide some example source code? what is ´x´ ? generally speaking thread divergence is a bad thing. — RoBiK, Jun 11 '13 at 10:28
@RoBiK Please see edit to original question. Hopefully that will give you a rough idea (i havent tried running it) of what i was trying to explain. Yeah i wasnt sure if it counted as thread divergence of not since. Thanks for taking the time to reply. — Hans Rudel, Jun 11 '13 at 11:36
as far as i can see there is no dependence on a `threadIdx`so basically all threads in a block will do exactly the same calculation. Are you sure the code is correct? — RoBiK, Jun 11 '13 at 13:01
@RoBiK yeah there is no dependence on threadIdx, i was just wondering if there would be an issue if one or more of the threads would take longer to execute the inner for loop, leading to other threads then executing the outer for loop at the same time. — Hans Rudel, Jun 11 '13 at 13:23

score 2 · Accepted Answer · answered Jun 11 '13 at 14:39

CUDA runs threads in groups called warps. On all CUDA architectures that have been implemented so far (up to compute capability 3.5), the size of a warp is 32 threads. Only threads in different warps can truly be at different locations in the code. Within a warp, threads are always in the same location. Any threads that should not be executing the code in a given location are disabled as that code is executed. The disabled threads are then just taking up room in the warp and cause their corresponding processing cycles to be lost.

In your algorithm, you get warp divergence because the exit condition in the inner loop is not satisfied at the same time for all the threads in the warp. The GPU must keep executing the inner loop until the exit condition is satisfied for ALL the threads in the warp. As more threads in a warp reach their exit condition, they are disabled by the machine and represent lost processing cycles.

In some situations, the lost processing cycles may not impact performance, because disabled threads do not issue memory requests. This is the case if your algorithm is memory bound and the memory that would have been required by the disabled thread was not included in the read done by one of the other threads in the warp. In your case, though, the data is arranged in such a way that accesses are coalesced (which is a good thing), so you do end up losing performance in the disabled threads.

Your algorithm is very simple and, as it stands, the algorithm does not fit that well on the GPU. However, I think the same calculation can be dramatically sped up on both the CPU and GPU with a different algorithm that uses an approach more like that used in parallel reductions. I have not considered how that might be done in a concrete way though.

A simple thing to try, for a potentially dramatic increase in speed on the CPU, would be to alter your algorithm in such a way that the inner loop iterates forwards instead of backwards. This is because CPUs do cache prefetches. These only work when you iterate forwards through your data.

Hi Roger, thanks for taking the time to reply. I was hoping there wouldnt be any divergence. Yeah it was the simplest i could think of which explained the potential issue. Unfortunately i have more complicated algo's. Thanks for the info about prefetches, unfortunately its not possible to implement that improvement. Will keep it in mind though incase its possible to use later. I guess its back to evaluating servers or phi then. +1 + accepted ans btw :) — Hans Rudel, Jun 11 '13 at 15:05
Remember that the end goal is not to run a specific algorithm, it is to perform a specific computation. It is often possible to perform the same computation with many different algorithms. There is a good chance that a given computation can be accomplished with an alternate algorithm that fits well on the GPU architecture. Often, this is not the same algorithm as the one you would use on a CPU. — Roger Dahl, Jun 11 '13 at 16:25
@RogerDahl how did you come to the conclusion that the exit condition in the inner loop is going to differ for different threads? There is no dependence on the threadId so how could different threads produce different values? — RoBiK, Jun 11 '13 at 16:36
@RoBiK: I think the code is incorrect. The OP stated that it's an untested example. I just focused on the question and used the code to get the gist of what the OP is trying to do. — Roger Dahl, Jun 11 '13 at 16:44
@RoBiK" thread at x = 1500 will have to look back 500 places yet the thread at x = 1510 only looks back 300" ie the number of data points each thread looks back is not fixed. Im assuming if they were the same then everything would be executed in sync, but i will have cases where some threads will be executing the outer for loop, while others are still executing the inner for loop. — Hans Rudel, Jun 12 '13 at 07:12

Is there a performance penalty for CUDA method not running in sync?

1 Answers1