The kernel launch (or issuing of warps, or blocks) will not be limited by the heap size. Instead, the kernel launch will fail, if the number of issued threads (which have reached the per-thread malloc, but not the corresponding free) times requested allocation per thread cannot be satisfied. You may wish to refer to the heap memory allocation section of the CUDA C programmers guide. A per-thread allocation sample code is given in that section, and you can easily modify that code to prove this behavior to yourself. Simply adjust the heap size and number of threads (or blocks) launched to see the behavior when the heap limit is reached. And yes, the cudaLimitMallocHeapSize is used actually for the whole device context, so it applies to all kernel launches which come after the relevant call to cudaDeviceSetLimit(). It is not a per-thread limit. Also note that there is some allocation overhead. Setting a heap size of 128MB does not mean that all 128MB will be available for subsequent device malloc operations. It may also be useful to mention that device malloc operations are only possible on CC 2.0 and above.