About stack and heap
Stack is allocated per thread and has an hardware limit (see below).
Heap reside in global memory, can be allocated using malloc() and must be explicitly freed using free() (CUDA doc).
This device functions:
void* malloc(size_t size);
void free(void* ptr);
can be useful but I would recommend to use them only when they are really needed. It would be a better approach to rethink the code to allocate the memory using the host-side functions (as cudaMalloc
).
The stack size has an hardware limit which can be computed (according to this answer by @njuffa) by the minimum of:
- amount of local memory per thread
- available GPU memory / number of SMs / maximum resident threads per SM
As you are increasing the size, and you are running only one thread, I guess your problem is the second limit, which in your case (TESLA M2090) should be: 6144/16/512 = 750KB
.
The heap has a fixed size (default 8MB) that must be specified before any call to malloc()
by using the function cudaDeviceSetLimit
. Be aware that the memory allocated will be at least the size requested due to some allocation overhead.
Also it is worth mentioning that the memory limit is not per-thread but instead has the lifetime of the CUDA context (until released by a call to free()) and can be used by thread in a subsequent kernel launch.
Related posts on stack: ... stack frame for kernels, ... local memory per cuda thread
Related posts on heap: ... heap memory ..., ... heap memory limitations per thread