4

As in title, can someone make sense for me more about heap and stack in CUDA? Does it have any different with original heap and stack in CPU memory?

I got a problem when I increase stack size in CUDA, it seem to have its limitation, because when I set stack size over 1024*300 (Tesla M2090) by cudaDeviceSetLimit, I got an error: argument invalid.

Another problem I want to ask is: when I set heap size to very large number (about 2GB) to allocate my RTree (data structure) with 2000 elements, I got an error in runtime: too many resources requested to launch

Any idea?

P/s: I launch with only single thread (kernel<<<1,1>>>)

Hoang Thong
  • 95
  • 1
  • 10
  • the "too many resources requested to launch" error can be related to the number of registers and threads you are using in your kernel. try to print the usage by adding -Xptxas="-v" to your compile line. – terence hill Jan 14 '16 at 16:29
  • 1
    Why you need to increase the stack/heap size? Please, post an example that reproduce the problem. – terence hill Jan 14 '16 at 16:31
  • see also this post http://stackoverflow.com/questions/13150618/understanding-gpu-heap-memory-and-resident-warps – terence hill Jan 14 '16 at 16:32
  • because when I debug with Cuda Debuging, it occur an error like: `detect data stack over flow`, seem like it's out of stack memory, so I increase the stack size and the problem has been solved – Hoang Thong Jan 14 '16 at 16:34
  • 1
    questions seeking debugging help (why isn't this code working?) are expected to include an MCVE. And there is no exception for conceptual questions (why *might* my code, which I haven't shown, be failing?) If there were such an exception, then everyone could use it and the MCVE [requirement](http://stackoverflow.com/help/on-topic) would be pointless. – Robert Crovella Jan 14 '16 at 16:55
  • My code is RTree with about 1000 lines, how can I post it here? On the other hand, my question is started with the problem about stack and heap between CPU memory and GPU memory, the rest has just to make my problem sense. My problem is not about the debugging, it's about GPU memory hierachy. Any idea? – Hoang Thong Jan 14 '16 at 17:13
  • Stack is allocated per thread and you can see in my answer that has, in fact an hardware limit. Heap however reside in global memory. These informations are in the CUDA programming guide. However it is difficult to give some reasonable answer without the context. I would say that you have to rethink your code. – terence hill Jan 14 '16 at 17:37
  • 1
    Nobody wants to see *your* code. I would suggest that you start by reading what an [MCVE](http://stackoverflow.com/help/mcve) is. For the two problems you describe in your second paragraph and your third paragraph, an MCVE would probably only need to be about 20 lines of code, maybe less. I don't think it takes 1000 lines of code to demonstrate an `invalid argument` on a call to `cudaDeviceSetLimit`. – Robert Crovella Jan 14 '16 at 20:15
  • Setting to 300 * 1024 bytes per thread would increase the stack size to > 300 * 1024 * 16 SMs * 48 warps/SM * 32 threads/warps = 7549747200 bytes. The local memory size is allocated based upon the maximum number of threads that can run on the GPU at one time. Launch configuration is not part of the allocation. – Greg Smith Jan 19 '16 at 02:09

2 Answers2

7

About stack and heap

Stack is allocated per thread and has an hardware limit (see below). Heap reside in global memory, can be allocated using malloc() and must be explicitly freed using free() (CUDA doc).

This device functions:

void* malloc(size_t size);
void free(void* ptr);

can be useful but I would recommend to use them only when they are really needed. It would be a better approach to rethink the code to allocate the memory using the host-side functions (as cudaMalloc).


The stack size has an hardware limit which can be computed (according to this answer by @njuffa) by the minimum of:

  • amount of local memory per thread
  • available GPU memory / number of SMs / maximum resident threads per SM

As you are increasing the size, and you are running only one thread, I guess your problem is the second limit, which in your case (TESLA M2090) should be: 6144/16/512 = 750KB.


The heap has a fixed size (default 8MB) that must be specified before any call to malloc() by using the function cudaDeviceSetLimit. Be aware that the memory allocated will be at least the size requested due to some allocation overhead. Also it is worth mentioning that the memory limit is not per-thread but instead has the lifetime of the CUDA context (until released by a call to free()) and can be used by thread in a subsequent kernel launch.

Related posts on stack: ... stack frame for kernels, ... local memory per cuda thread

Related posts on heap: ... heap memory ..., ... heap memory limitations per thread

Community
  • 1
  • 1
terence hill
  • 3,354
  • 18
  • 31
4

Stack and heap are different things. Stack represents the per thread stack, heap represents the per context runtime heap that device malloc/new uses to allocate memory. You set stack size with the cudaLimitStackSize flag, and runtime heap with the cudaLimitMallocHeapSize flag, both passed to the cudaDeviceSetLimit API.

It sounds like you are wanting to increase the heap size, but are trying to do so by changing the stack size. On the other hand, if you need a large stack size, you may have to reduce the number of threads per block you use in order to avoid kernel launch failures.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • "if you need a large stack size, you may have to reduce the number of threads per block". I thought that the stack size is an hardware limit, also confirmed by this post by juffa https://devtalk.nvidia.com/default/topic/642743/what-is-the-maximum-cuda-stack-frame-size-per-kerenl-/. What I am missing? – terence hill Jan 14 '16 at 16:58
  • no, i increase both of them to very large size (300Mb for stack and 2Gb for heap). My code is a RTree structure with about more over 1000 lines, so that I cant post it here. All I want is to port my RTree to stay on GPU memory, but even when i increase to large size, the error about out of stack size still occurs – Hoang Thong Jan 14 '16 at 16:59
  • I was asking: how reducing the number of running threads can be of any help if there is a physical limitation to the stack size? If you surpass the physical size (as I think is happening in this question) then the launch will fail also with one thread. – terence hill Jan 14 '16 at 17:10
  • Yes, I probably misinterpreted because I was already thinking on the 1 thread case, thanks for the clarification. – terence hill Jan 14 '16 at 17:16