Is starting 1 thread per element always optimal for data independent problems on the GPU?

Question

I was writing a simple memcpy kernel to meassure the memory bandwith of my GTX 760M and to compare it to cudaMemcpy(). It looks like that:

template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
    using vector_type = int2;
    vector_type* src2 = reinterpret_cast<vector_type*>(src);
    vector_type* dest2 = reinterpret_cast<vector_type*>(dest);

    //This copy kernel is only correct when size%sizeof(vector_type)==0
    auto numElements = size / sizeof(vector_type);

    for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
        dest2[id] = src2[id];
    }
}

I also calculated the number of blocks required to reach 100% occupancy like so:

THREADS_PER_BLOCK = 256 
Multi-Processors: 4 
Max Threads per Multi Processor: 2048 
NUM_BLOCKS = 4 * 2048 / 256 = 32

My tests on the other hand showed, that starting enough blocks so that each thread only processes one element always outperformed the "optimal" block count. Here are the timings for 400mb of data:

bandwidth test by copying 400mb of data.
cudaMemcpy finished in 15.63ms. Bandwidth: 51.1838 GB/s
thrust::copy finished in 15.7218ms. Bandwidth: 50.8849 GB/s
my memcpy (195313 blocks) finished in 15.6208ms. Bandwidth: 51.2137 GB/s
my memcpy (32 blocks) finished in 16.8083ms. Bandwidth: 47.5956 GB/s

So my questions are:

Why is there a speed difference?

Are there any downsides of starting one thread per element, when each element can be processed completely independent of all other elements?

"Is starting 1 thread per element always optimal for data independent problems?" - No. Think 1000 independent elements; 1000 threads on a 4 core machine - you'll drown in thread/task switching overhead. — Jesper Juhl, May 07 '17 at 14:11
Just to make sure: Did you make multiple measurements and look at the averages of their running times/bandwith? Benchmarks tend to vary quite a bit between single runs. — Ben Steffan, May 07 '17 at 14:12
@BenSteffan Yes, the timings were varying less than +-0.1ms. — dari, May 07 '17 at 14:15
@JesperJuhl The question was targeted for modern GPUs with almost zero thread creation time and no thread switching/swapping. — dari, May 07 '17 at 14:23

Robert Crovella · Accepted Answer · 2017-05-07T15:29:47.390

Is starting 1 thread per element always optimal for data independent problems on the GPU?

Not always. Let's consider 3 different implementations. In each case we'll assume we're dealing with a trivially parallelizable problem that involves one element load, some "work" and one element store per thread. In your copy example there is basically no work - just loads and stores.

One element per thread. Each thread is doing 1 element load, the work, and 1 store. The GPU likes to have a lot of exposed parallel-issue-capable instructions per thread available, in order to hide latency. Your example consists of one load and one store per thread, ignoring other instructions like index arithmetic, etc. In your example GPU, you have 4 SMs, and each is capable of a maximum complement of 2048 threads (true for nearly all GPUs today), so the maximum in-flight complement is 8192 threads. So at most, 8192 loads can be issued to the memory pipe, then we're going to hit machine stalls until that data comes back from memory, so that the corresponding store instructions can be issued. In addition, for this case, we have overhead associated with retiring threadblocks and launching new threadblocks, since each block only handles 256 elements.
Multiple elements per thread, not known at compile time. In this case, we have a loop. The compiler does not know the loop extent at compile time, so it may or may not unroll the the loop. If it does not unroll the loop, then we have a load followed by a store per each loop iteration. This doesn't give the compiler a good opportunity to reorder (independent) instructions, so the net effect may be the same as case 1 except that we have some additional overhead associated with processing the loop.
Multiple elements per thread, known at compile time. You haven't really provided this example, but it is often the best scenario. In the parallelforall blog matrix transpose example, the writer of that essentially copy kernel chose to have each thread perform 8 elements of copy "work". The compiler then sees a loop:
```
  LOOP:  LD R0, in[idx];
         ST out[idx], R0;
         ...
         BRA  LOOP;
```
which it can unroll (let's say) 8 times:
```
     LD R0, in[idx];
     ST out[idx], R0;
     LD R0, in[idx+1];
     ST out[idx+1], R0;
     LD R0, in[idx+2];
     ST out[idx+2], R0;
     LD R0, in[idx+3];
     ST out[idx+3], R0;
     LD R0, in[idx+4];
     ST out[idx+4], R0;
     LD R0, in[idx+5];
     ST out[idx+5], R0;
     LD R0, in[idx+6];
     ST out[idx+6], R0;
     LD R0, in[idx+7];
     ST out[idx+7], R0;
```
and after that it can reorder the instructions, since the operations are independent:
```
     LD R0, in[idx];
     LD R1, in[idx+1];
     LD R2, in[idx+2];
     LD R3, in[idx+3];
     LD R4, in[idx+4];
     LD R5, in[idx+5];
     LD R6, in[idx+6];
     LD R7, in[idx+7];
     ST out[idx], R0;
     ST out[idx+1], R1;
     ST out[idx+2], R2;
     ST out[idx+3], R3;
     ST out[idx+4], R4;
     ST out[idx+5], R5;
     ST out[idx+6], R6;
     ST out[idx+7], R7;
```
at the expense of some increased register pressure. The benefit here, as compared to the non-unrolled loop case, is that the first 8 LD instructions can all be issued - they are all independent. After issuing those, the thread will stall at the first ST instruction - until the corresponding data is actually returned from global memory. In the non-unrolled case, the machine can issue the first LD instruction, but immediately hits a dependent ST instruction, and so it may stall right there. The net of this is that in the first 2 scenarios, I was only able to have 8192 LD operations in flight to the memory subsystem, but in the 3rd case I was able to have 65536 LD instructions in flight. Does this provide a benefit? In some cases, it does. The benefit will vary depending on which GPU you are running on.

What we have done here, is effectively (working in conjunction with the compiler) increase the number of instructions that can be issued per thread, before the thread will hit a stall. This is also referred to as increasing the exposed parallelism, basically via ILP in this approach. Whether or not it has any benefit will vary depending on your actual code, your actual GPU, and what else is going in the GPU at that time. But it is always a good strategy to increase exposed parallelism using techniques such as this, because the ability to issue instructions is how the GPU hides the various forms of latency that it must deal with, so we have effectively improved the GPU's ability to hide latency, with this approach.

Why is there a speed difference?

This can be difficult to answer without profiling the code carefully. However it's often the case that launching just enough threads to fully satisfy the instantaneous carrying capacity of the GPU is not a good strategy, possibly due to the "tail effect" or other types of inefficiency. It may also be the case that blocks are limited by some other factor, such as registers or shared memory usage. It's usually necessary to carefully profile as well as possibly study the generated machine code to fully answer such a question. But it may be that the loop overhead measurably impacts your comparison, which is basically my case 2 vs. my case 1 above.

(note the memory indices in my "pseudo" machine code example are not what you would expect for a well written grid-striding copy loop - they are just for example purposes to demonstrate unrolling and the benefit it can have via compiler instruction reordering).

Hey, your third case seems like a really good idea when actual processing needs to be done on the elements. For the memcpy example processing multiple elements per thread didn't seem to increase the performance. Would you generally recommend to process a fixed number of elements per thread and to scale the number of blocks to the problem instead of the hardware? — dari, May 07 '17 at 16:29
Your single-element-per-thread case was already saturating the memory bus - so there wasn't any benefit to exposing further parallelism in this case. But that pattern would not necessarily be true for all GPUs, it will depend on the GPU and the ratio of memory bus bandwidth to threads/requests in flight, amongst other factors. Your 760m GPU is not particularly high on the ratio of bus bandwidth to thread complement. Most GPUs nowadays seem to have a reasonable balance at 1 transaction in flight per thread, but Fermi GPUs definitely needed more than that to saturate the memory bus. — Robert Crovella, May 07 '17 at 19:27
In short, be careful about coming up with general principles based on your observation from a single GPU type. But the general principle of attempting to expose more parallelism/work is usually a good one for GPUs. Once you've hit a limiter on a particular GPU, then exposing additional parallelism may not yield any return, **on that GPU**. — Robert Crovella, May 07 '17 at 19:30

score 0 · Answer 2 · answered May 08 '17 at 18:37

0

One-liner answer: When you have one thread per element, you pay the thread setup cost - at the very least, copying argument from constant memory to registers - for every one element, and that's wasteful.

answered May 08 '17 at 18:37

einpoklum

118,144
57
340
684

Is starting 1 thread per element always optimal for data independent problems on the GPU?

2 Answers2

Linked