44

i am trying to allocate shared memory by using a constant parameter but getting an error. my kernel looks like this:

__global__ void Kernel(const int count)
{
    __shared__ int a[count];
}

and i am getting an error saying

error: expression must have a constant value

count is const! Why am I getting this error? And how can I get around this?

Micha Wiedenmann
  • 19,979
  • 21
  • 92
  • 137
lina
  • 1,679
  • 4
  • 21
  • 25

5 Answers5

94

CUDA supports dynamic shared memory allocation. If you define the kernel like this:

__global__ void Kernel(const int count)
{
    extern __shared__ int a[];
}

and then pass the number of bytes required as the the third argument of the kernel launch

Kernel<<< gridDim, blockDim, a_size >>>(count)

then it can be sized at run time. Be aware that the runtime only supports a single dynamically declared allocation per block. If you need more, you will need to use pointers to offsets within that single allocation. Also be aware when using pointers that shared memory uses 32 bit words, and all allocations must be 32 bit word aligned, irrespective of the type of the shared memory allocation.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 2
    On sm_21 I'm having the impression the `a_size` has 128 bytes granularity: launching Kernel with `a_size = 1` (when needing, say, 16) seems to work OK, and launching with `a_size - 127` for large `a_size` also works OK, while `a_size - 128` shows effects of too small allocation. (I'm reasonably sure "OK" does not mean "by chance".) Could this be related to L1 line size = 128 bytes? – P Marecki Oct 18 '12 at 09:46
  • @talonmies Is it possible to use a template to pass the parameters for the `a` length or it wont work? – BugShotGG Nov 07 '13 at 12:52
  • @GeoPapas: Yes, as long as the `extern` parameter isn't used. I recommend using templated instances of a kernel rather than dynamically allocated shared memory. The compiler can and will do a much better job of optimising code when array sizes are known at compilation time- – talonmies Nov 07 '13 at 14:05
  • Thank you. Is this memory automatically freed? – astrowalker Oct 05 '18 at 10:02
39

const doesn't mean "constant", it means "read-only".

A constant expression is something whose value is known to the compiler at compile-time.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
21

option one: declare shared memory with constant value (not the same as const)

__global__ void Kernel(int count_a, int count_b)
{
    __shared__ int a[100];
    __shared__ int b[4];
}

option two: declare shared memory dynamically in the kernel launch configuration:

__global__ void Kernel(int count_a, int count_b)
{
    extern __shared__ int *shared;
    int *a = &shared[0]; //a is manually set at the beginning of shared
    int *b = &shared[count_a]; //b is manually set at the end of a
}

sharedMemory = count_a*size(int) + size_b*size(int);
Kernel <<<numBlocks, threadsPerBlock, sharedMemory>>> (count_a, count_b);

note: Pointers to dynamically shared memory are all given the same address. I use two shared memory arrays to illustrate how to manually set up two arrays in shared memory.

jmilloy
  • 7,875
  • 11
  • 53
  • 86
  • 2
    Option two won't work unless the shared memory allocation is defined with the extern keyword in the kernel. – talonmies Apr 05 '11 at 18:27
  • @talonmies fixed! we should also mention that this means that if you have two kernels with shared memory, you should use different identifiers for the shared memory pointer. it's not completely necessary but can help avoid some errors. – jmilloy Apr 05 '11 at 23:31
  • @talonmies nvcc wouldn't compile when i had `extern __shared__ int *shared` in one kernel and `extern __shared__ float *shared` in another kernel – jmilloy Apr 06 '11 at 17:32
6

From the "CUDA C Programming Guide":

The execution configuration is specified by inserting an expression of the form:

<<<Dg, Db, Ns, S>>>

where:

  • Dg is of type dim3 and specifies the dimensioin and size of the grid ...
  • Db is of type dim3 and specifies the dimension and size of each block ...
  • Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory. This dynamically allocated memory is used by any of the variables declared as an external array as mentioned in __shared__; Ns is optional argument which defaults to 0;
  • S is of type cudaStream_t and specifies the associated stream ...

So by using the dynamical parameter Ns, the user can specify the total size of shared memory one kernel function can use, no matter how many shared variables there are in this kernel.

Pika Supports Ukraine
  • 3,612
  • 10
  • 26
  • 42
smh
  • 448
  • 4
  • 9
1

You cannot declare shared variable like this..

__shared__ int a[count];

although if you are sure enough about the max size of array a then you can directly declare like

__shared__ int a[100];

but in this case you should be worried about how many blocks are there in your program , since fixing shared memory to a block ( and not getting utilized fully), will lead you to context switching with global memory( high latency) , thus poor performance...

There is a nice solution to this problem to declare

extern __shared__ int a[];

and allocating the memory while calling kernel from memory like

Kernel<<< gridDim, blockDim, a_size >>>(count)

but you should also be bothered here because if you are using more memory in blocks than you are assigning in kernel , you are going to getting unexpected results.

talonmies
  • 70,661
  • 34
  • 192
  • 269
peeyush
  • 2,841
  • 3
  • 24
  • 43
  • why first is nonsense ? can we declare shared memory using some variable ? – peeyush Apr 06 '11 at 13:30
  • 2
    It is nonsense because (a) shared memory is, by definition, declared at block scope and every allocation is tied to a block, and (b) there is no such thing as "context switching with global memory" in CUDA. – talonmies Apr 06 '11 at 15:19
  • 1
    what if i have 16kb shared memory per SM and i am using 4kb of shared memory (extern shared memory), now if one block is scheduled on sm, then it uses 4kb of shared memory, but it's not done, okay now another 7 block get scheduled on shared memory, while each is using 4kb , so how this is going to be handled ? how cuda will manage 4kb of shared mem, while there are 8 blocks to be scheduled ? I agree that sm is at block scope, by shared int[count], i had objection over count that is variable, since variable can not be used in declaring shared memory.. am i wrong here ? – peeyush Apr 07 '11 at 19:48
  • 4
    The GPU only ever schedules as many blocks as will fit into the available register and shared memory space. If a block requires 4kb of shared memory, then at most 4 blocks will be active per SM if the SM has 16kb of shared memory. I guess I should have said your first paragraph is "mostly nonsense", because yes the first 10 words are correct, but it descends into fantasy after that. – talonmies Apr 07 '11 at 19:59