Why my cuda program became slower after using 128 threads on blocks?

Question

I have a simple cuda application with the following code:

#include <stdio.h>
#include <sys/time.h>
#include <stdint.h>
__global__
void daxpy(int n, int a, int *x, int *y) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  y[i] = x[i];
  int j;
  for(j = 0; j < 1024*10000; ++j) {
     y[i] += j%10;
  }
}
// debug time
void calc_time(struct timeval *start, const char *msg) {
   struct timeval end;
   gettimeofday(&end, NULL);
   uint64_t us = end.tv_sec * 1000000 + end.tv_usec - (start->tv_sec * 1000000 + start->tv_usec);
   printf("%s cost us = %llu\n", msg, us);
   memcpy(start, &end, sizeof(struct timeval));
}
void do_test() {
   unsigned long n = 1536;
   int *x, *y, a, *dx, *dy;
   a = 2.0;
   x = (int*)malloc(sizeof(int)*n);
   y = (int*)malloc(sizeof(int)*n);
   for(i = 0; i < n; ++i) {
      x[i] = i;
   }

   cudaMalloc((void**)&dx, n*sizeof(int));
   cudaMalloc((void**)&dy, n*sizeof(int));
   struct timeval start;
   gettimeofday(&start, NULL);
   cudaMemcpy(dx, x, n*sizeof(int), cudaMemcpyHostToDevice);

   daxpy<<<1, 512>>>(n, a, dx, dy); // this line 
   cudaThreadSynchronize();
   cudaMemcpy(y, dy, n*sizeof(int), cudaMemcpyDeviceToHost);
   calc_time(&start, "do_test ");
   cudaFree(dx);
   cudaFree(dy);
   free(x);
   free(y);
}
int main() {
   do_test();
   return 0;
}

The gpu kernel call is daxpy<<<1, 512>>>(n, a, dx, dy) and I performed some tests using different block sizes:

daxpy<<<1, 32>>>(n, a, dx, dy)
daxpy<<<1, 64>>>(n, a, dx, dy)
daxpy<<<1, 128>>>(n, a, dx, dy)
daxpy<<<1, 129>>>(n, a, dx, dy)
daxpy<<<1, 512>>>(n, a, dx, dy)

... and made the following observations:

Execution time is the same for 32, 64, and 128 block sizes,
Execution time differs for block sizes 128 and 129, in particular:
- For 128 the execution time is 280ms,
- For 129 the execution time is 386ms.

I would like to ask what is causing the difference in execution time for block sizes 128 and 129.

My GPU is tesla K80:

CUDA Driver Version / Runtime Version          6.5 / 6.5
CUDA Capability Major/Minor version number:    3.7
Total amount of global memory:                 11520 MBytes (12079136768 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
GPU Clock rate:                                824 MHz (0.82 GHz)
Memory Clock rate:                             2505 Mhz
Memory Bus Width:                              384-bit
L2 Cache Size:                                 1572864 bytes
Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  2048
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Enabled
Device supports Unified Addressing (UVA):      Yes
Device PCI Bus ID / PCI location ID:           135 / 0

Could you be a bit more specific on what does the "slower" means? — Michal Hosala, May 06 '15 at 14:25
You're giving the GPU more work to do. It makes sense that it takes longer. Your kernels are all composed of a single threadblock, meaning they will all run on a single SM. At 128 threads, you have 4 warps which will more fully occupy the single SM, causing the code to take longer. — Robert Crovella, May 06 '15 at 14:29
@RobertCrovella I agree with you, however, I would personally expect the time difference to be _very_ small, if any. Execution should not take longer for 128 threads (comparing to 64) as the SMX has 192 cores. On top of that, K80 features quad warp scheduler so there should be no overhead related to warp scheduling either. So what does the time difference actually arise from? — Michal Hosala, May 06 '15 at 14:41
@MichalHosala the 192 cores are SP floating point cores. This is an integer kernel. Furthermore, I don't know what the rate of a modulo (`%`) operation is, but I suspect it is not full integer throughput. I consider your thesis to be flawed, and I don't want to have a sidebar discussion in the comments about it. This question itself is poorly posed in my opinion, if for no other reason than the one you yourself pointed out. If you wish to ask a question, please ask a new, proper SO question. — Robert Crovella, May 06 '15 at 15:30
@RobertCrovella. But why it cost the same time when using 32 threads and 64 threads. — ccwenlin, May 07 '15 at 01:42
The GPU has many rate limiters (throughput restrictions) at various places in the architecture. It seems that in going from 32 threads in a single block to 64 threads in a single block, you've not hit any of the rate limiters (throughput limits). Going from 64 to 128 you probably have. You've not shown a complete code that someone else could test, nor indicated what your timing measurements actually are, so this is just speculation. — Robert Crovella, May 07 '15 at 01:48
Also, although I'm not sure since your example here is far from complete, but if the *only* thing you are changing is the threads per block in the kernel launch, then your different cases are doing differing amounts of work. Since the 32 threads per block case is doing half the work (but with half the threads) of the 64 threads per block case, this may explain why those timings are the same. — Robert Crovella, May 07 '15 at 02:00
@MichalHosala sorry for I dit not explain it clearly. Its not the num 128 when it became slower. Its from 129. The cost time from 1 to 128 threads is about 280ms but it became 386ms when using 129 threads. — ccwenlin, May 07 '15 at 02:38
@ccwenlin I updated the question based on the numbers you gave us in the above comment. Is the answer I provided of some use to you? Or do you need something else? — Michal Hosala, May 09 '15 at 08:44

Michal Hosala · Accepted Answer · 2015-05-07T11:33:07.203

After providing us with the exact time differences in one of the comments, i.e.:

280ms for up to 128 threads,
386ms for 129+ threads,

I think it indirectly supports my theory of issue being related to warp scheduling. Look at the GK210 whitepaper, which is a chip used in K80:

K80 SMX features a quad warp scheduler, see section Quad Warp Scheduler,
It means that K80 SMX is able to schedule up to 128 threads at once (4 warps == 128 threads), these are then executed simultaneously,

Therefore, for 129 threads, scheduling cannot happen at once, because SMX has to schedule 5 warps, i.e. scheduling will happen in two steps.

If the above is true, then I would expect:

The execution time to be roughly the same for block sizes 1 - 128,
The execution time to be roughly the same for block sizes 129 - 192.

192 is the number of cores on the SMX, see whitepaper. As a reminder - entire blocks are always scheduled for one SMX and so obviously if you spawn more than 192 threads then those for sure won't be able to execute in parallel and execution time should be higher for 193+ number of threads.

You can verify the above thesis by simplifying your kernel code to the degree where it will do almost nothing so it should be more or less obvious whether the execution takes longer only due to scheduling (there will be no other limiting factors such as memory throughput).

Disclaimer: The above are just my assumptions as I don't have access to K80, nor any other GPU with quad warp scheduler so I cannot profile your code properly. But anyway, I believe that is the task for you - why not to use nvprof and profile your code yourself? Then you should be able to see where the time difference lies.

Why my cuda program became slower after using 128 threads on blocks?

1 Answers1