The GPU is a latency hiding architecture. The execution units are pipelined. The depth of the pipeline is not disclosed. For this answer let's assume the device can execute 1 instruction per cycle and the dependent instruction latency is 8 cycles.
Assume a really simple program that has dependencies between instructions:
1. ADD R0, R1, R2
2. ADD R3, R1, R2
3. ADD R0, R3, R4 read r3 after write r3
4. LD R1, R0 read r0 after write r0
5. ADD R1, R1, R2 read r1 after write r1
time in cycles -->
0 4
0 1 2 3 0
123456789012345678901234567890...01234567890
--------------------------------------------
warp 0 issues 12.......3.......4............>>>5..........
warp 0 retires ........12.......3............>>>4.......5..
The graph show at what cycles warp 0 issues instructions and what cycle the instruction retires. There is a discontinuity on the timeline of 370 cycles to cover the latency of a global memory access which can be 200-1000 cycles.
If you add more warps those warps can issue at any time on the timeline where warp 0 issue is a .
You kernel will scale with almost no increase in time until the warp scheduler has sufficient warps to issue every cycle. Once this threshold is hit then the warp scheduler is oversubscribed and execution time will increase. Execution time can also increase by increase use of math pipes or the memory subsystem.
If you are working on a Fermi or newer GPU you can use Nsight VSE CUDA Profiler Issue Efficiency experiment to see how increasing the number of blocks/warps/threads affects the schedulers efficiency and you can also inspect the reasons that warps are stalled.