5

I just noticed that my CUDA kernel uses exactly twice the space than that calculated by 'theory'. e.g.

__global__ void foo( )
{
    __shared__ double t;
    t = 1;
}

PTX info shows:

ptxas info    : Function properties for _Z3foov, 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads  
ptxas info    : Used 4 registers, 16 bytes smem, 32 bytes cmem[0]  

But the size of a double is only 8.

Another example:

__global__ void foo( )
{
    __shared__ int t[1024];
    t[0] = 1;
}

gives

ptxas info    : Used 3 registers, 8192 bytes smem, 32 bytes cmem[0]

Could someone explain why?

paleonix
  • 2,293
  • 1
  • 13
  • 29
Rainn
  • 315
  • 1
  • 9
  • 1
    @Fr34K: How can run time settings possibly have anything to do with the size of a *static assignment* made by the compiler and assembler? – talonmies Nov 10 '12 at 07:24
  • Need more clarification. How many Blocks have you initialized?? – Fr34K Nov 10 '12 at 08:06
  • @talonmies: Thanks for the clarification. I missed that point. :) – Fr34K Nov 10 '12 at 08:07
  • 2
    honestly I've never used statically allocated sharem memory in cuda kernels but this looks strange. Which cuda version do you have ? is it for 64-bits ? –  Nov 10 '12 at 08:32
  • 1
    If I compile your first kernel with the 4.2 release of nvcc, I only get 8 bytes of static shared memory for both compute 1.3 and compute 2.0 targets (ie. both the old open64 and new llvm based compilers). What platform and CUDA version are you using for this? – talonmies Nov 10 '12 at 10:13
  • @asm I am using CUDA 5.0 on Ubuntu 12.04 64-bit. Actually I just noticed that the **visual profiler** reports the correct size of shared memory, while the **nsight eclipse edition / nvcc** reports twice the actual size. I suspect this is a bug in the nvcc compiler. – Rainn Nov 10 '12 at 16:22
  • @talonmies I am using CUDA 5 for compute capability 2.1 (Geforce 560 GTX) – Rainn Nov 10 '12 at 16:24
  • @Rainn Dump out the ptx and see if it is doing something unexpected ? – Pavan Yalamanchili Feb 13 '13 at 00:07
  • Is this kernel the only thing you are compiling? Can you provide a complete .cu file that reproduces this result? – harrism Feb 28 '13 at 02:56
  • Rainn, if you don't provide enough information to reproduce this, I'll vote to close it... – harrism Apr 08 '13 at 12:54

1 Answers1

1

Seems that the problem has gone in the current CUDA compiler.

__shared__ int a[1024];

compiled with command

nvcc -m64 -Xptxas -v -ccbin /opt/gcc-4.6.3/bin/g++-4.6.3 shmem.cu

gives

ptxas info    : Used 1 registers, 4112 bytes smem, 4 bytes cmem[1]

There are some shared memory overhead in this case, but the usage is not doubled.

paleonix
  • 2,293
  • 1
  • 13
  • 29
Rainn
  • 315
  • 1
  • 9
  • 3
    In Question the compiler is likely removing the t as the variable has no side effect. The 16 bytes is likely because you are compiling for compute_1x architecture. The answer above also is compiling for compute_10. If you compile to SASS then you can use cuobjdump to inspect each shared memory allocation to determine how the size is calculated. – Greg Smith Apr 12 '13 at 00:51