Understanding GPU heap memory and resident warps

Question

Is the number of resident warps also limited by the user-specified heap size?

For example, if each thread needs to allocate 1 MB memory and if the heap is set to a total of 32 MB (I'm assuming that cudaLimitMallocHeapSize is used for heap usage per kernel launch rather than per thread, is that correct?). Would it be true that only one warp is allowed on the device?

score 1 · Accepted Answer · answered Oct 31 '12 at 03:31

The kernel launch (or issuing of warps, or blocks) will not be limited by the heap size. Instead, the kernel launch will fail, if the number of issued threads (which have reached the per-thread malloc, but not the corresponding free) times requested allocation per thread cannot be satisfied. You may wish to refer to the heap memory allocation section of the CUDA C programmers guide. A per-thread allocation sample code is given in that section, and you can easily modify that code to prove this behavior to yourself. Simply adjust the heap size and number of threads (or blocks) launched to see the behavior when the heap limit is reached. And yes, the cudaLimitMallocHeapSize is used actually for the whole device context, so it applies to all kernel launches which come after the relevant call to cudaDeviceSetLimit(). It is not a per-thread limit. Also note that there is some allocation overhead. Setting a heap size of 128MB does not mean that all 128MB will be available for subsequent device malloc operations. It may also be useful to mention that device malloc operations are only possible on CC 2.0 and above.

Probably work noting that the same applies to consumption of runtime heap by kernel recursive call stacks. — talonmies, Oct 31 '12 at 14:32

Understanding GPU heap memory and resident warps

1 Answers1

Linked