1

I have a question about the warps in GPU.

I used the following configuration:

  • GeForce 210
  • Cuda capability major/minor : 1.2
  • 2 multiprocessors, 8 CUDA Cores/MP : 16 CUDA Cores
  • Warp size : 32

Below are the running times (I used nsight):

block,threads/block : time
--------------------------
1,32 : 5.1
8,32 : 5.4
16,32 : 5.7
32,32 : 8.9
64,32 : 14.8

Warps (=32 threads) run concurrently, and there are 2 MPs. So I thought 64 threads maximum capability of this GPU, but 16*32 threads run almost same time. Considering warp scheduler, I can't understand this result.

My questions are:

  1. why 16*32 threads runs almost same time as 32 threads?
  2. why 64*32 run time isn't twice of 32*32
  3. I heard that global memory access is fast as register. is it right? (include 3.5 GPU or expensive GPU)
einpoklum
  • 118,144
  • 57
  • 340
  • 684
proxiajd
  • 11
  • 1
  • What your code is actually doing? Concerning your first question, you are not mentioning the units of the time. Is 5.1 microseconds, milliseconds, seconds? You might be actually seeing just kernel launch overhead if you are not giving enough work to do to your GPU. Concerning your second question, in general the code performance do not simply scale with the number of threads for several reasons, and one of the might be the same reason as the first question. Regarding your third question, where have you read this? – Vitality May 07 '14 at 20:31

2 Answers2

3

The GPU is a latency hiding architecture. The execution units are pipelined. The depth of the pipeline is not disclosed. For this answer let's assume the device can execute 1 instruction per cycle and the dependent instruction latency is 8 cycles.

Assume a really simple program that has dependencies between instructions:

1. ADD     R0, R1, R2
2. ADD     R3, R1, R2
3. ADD     R0, R3, R4   read r3 after write r3
4. LD      R1, R0       read r0 after write r0
5. ADD     R1, R1, R2   read r1 after write r1

time in cycles -->
                0                                4
                0        1         2         3   0
                123456789012345678901234567890...01234567890
                --------------------------------------------
warp 0 issues   12.......3.......4............>>>5..........
warp 0 retires  ........12.......3............>>>4.......5..

The graph show at what cycles warp 0 issues instructions and what cycle the instruction retires. There is a discontinuity on the timeline of 370 cycles to cover the latency of a global memory access which can be 200-1000 cycles.

If you add more warps those warps can issue at any time on the timeline where warp 0 issue is a .

You kernel will scale with almost no increase in time until the warp scheduler has sufficient warps to issue every cycle. Once this threshold is hit then the warp scheduler is oversubscribed and execution time will increase. Execution time can also increase by increase use of math pipes or the memory subsystem.

If you are working on a Fermi or newer GPU you can use Nsight VSE CUDA Profiler Issue Efficiency experiment to see how increasing the number of blocks/warps/threads affects the schedulers efficiency and you can also inspect the reasons that warps are stalled.

Greg Smith
  • 11,007
  • 2
  • 36
  • 37
0

For your first 2 questions, please verify the GPU specifications. It also depends upon your code implementation. You get different speedup depending upon the algorithm being implemented. It depends upon the extent to which the algorithm is parallelized compared to its sequential counterpart.

For 3rd question, No. Global memory accesses are much slower than the access to registers and shared memory. That's the reason, we use shared memory optimizations. The thumb rule is if something in the global memory is accessed multiple times, it's better to access it only ones and get it in shared memory or private variables.

Yogi Joshi
  • 786
  • 1
  • 6
  • 19