How is CUDA memory managed?

Question

When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.

Is there a virtual memory concept in CUDA?
If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?
If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?

Can anyone help me answer these questions? Thanks

Edit 1: operating system: x86_64 GNU/Linux CUDA version: 4.0 Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.

Edit 2: The following is what I got after doing some research. Feel free to correct me.

CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.
CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.
cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.
On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.

New questions: 1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory. 2. Is there any way to restructure the GPU memory ?

Can you edit the question to include which operating system, GPU and cuda version you are using, and whether the GPU is a display or non display device. It will have a bearing on the correct answer to your question. — talonmies, Dec 31 '11 at 02:04
To answer the extra questions - user observable fragmentation occurs *within a context*, and no there is no way to change memory mapping within the GPU, that is all handled by the host driver. — talonmies, Jan 01 '12 at 10:11
As you explain, a context allocation is composed of context static allocation, context user allocation and CUDA context runtime heap. I think the size of context static allocation and context user allocation is pre-decided. Therefore, I think the only cause of memory fragmentation is context runtime heap which is only on Fermi architecture. Is that correct? I guess the system will pre-allocate a chunk of memory for context runtime heap so that the in-kernel dynamic memory allocation is enable. — xhe8, Jan 02 '12 at 03:11
Your question is currently kind of a mess. can you edit it to just have initial backround, then a bunch of questions? — einpoklum, Dec 02 '13 at 07:04

talonmies · Accepted Answer · 2017-12-09T09:50:42.440

The device memory available to your code at runtime is basically calculated as

Free memory =   total memory 
              - display driver reservations 
              - CUDA driver reservations
              - CUDA context static allocations (local memory, constant memory, device code)
              - CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)
              - CUDA context user allocations (global memory, textures)

if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.

The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.

talonmies, thanks for your information. It is very helpful. One more question, is it possible that multiple contexts exist in device memory? — xhe8, Dec 31 '11 at 06:10
Yes it is possible, but a given thread can only ever hold a single context on a given device. The usual scenario would be two processes trying to run on the same GPU at the same time, or a multithreaded app opening two contexts with two threads. The latter is much harder to do in CUDA 4 than it used to be. — talonmies, Dec 31 '11 at 06:21
Then what mechanism is used to allocate memory for multiple contexts? How can the system make sure different contexts will be allocated different portion of memory? — xhe8, Dec 31 '11 at 06:33
The allocated memory using cudaMalloc belongs to "CUDA context static allocations", correct? — xhe8, Dec 31 '11 at 06:36
No, Context user allocations. Static allocations are those things which are compiled into the context (local memory, constant memory, static symbols, device code). Different contexts are managed by the CUDA host driver (and WDDM on vista/win7). Each CUDA context gets its' own virtual address space and the driver maintains separation. Memory and pointers are non-portable between contexts (except when using the Fermi only unified address space model). You will have to trust that the drive works (and it does..) — talonmies, Dec 31 '11 at 06:54
Hi @talonmies. I have a question about memory segmentation: "constant, global, local memory and runtime heap are all parts of the 6 pieces of DRAM and there is no any difference from appearance as the Fermi whitepaper shows. It is the host driver that makes them have different functions", is that correct? Thanks in advance! — biubiuty, Nov 04 '12 at 01:43

Michael Haidl · Answer 2 · 2011-12-31T10:14:37.417

4

GPU off-chip memory is separated in global, local and constant memory. This three memory types are a virtual memory concept. Global memory is free for all threads, local is just for one thread only (mostly used for register spilling) and constant memory is cached global memory (writable only from host code). Have a look at 5.3.2 from the CUDA C Programming Guide.
EDIT: removed
Memory allocated via cudaMalloc does never overlap. For the memory a kernel allocates during runtime should be enough memory available. If you are out of memory and try to start a kernel (only a guess from me) you should get the "unknown error" error message. The driver than was unable to start and/or executes the kernel.

edited Dec 31 '11 at 10:14

answered Dec 30 '11 at 23:44

Michael Haidl

5,384
25
43

Thank you for your reply. But I think I want more low-level explanation. I learned from other posts that the CUDA memory management has something to deal with contexts and some data structures, but I want more detain explanation so that I can figure out the memory problem in my program. – xhe8 Dec 31 '11 at 03:03
Your second answer is mostly wrong. Kernel scope memory is *pre allocated* at the time the context is established on a device. The contents of local memory are only valid for the duration of a Kernel run, but the memory itself is reserved when a context is established. Dynamic memory is allocated from a runtime heap which is also reserved at context establishment time and it remains accessible and valid for the life of the context, not the kernel. There is an API call for manipulating heap size at runtime from the default size, if required. – talonmies Dec 31 '11 at 04:22

How is CUDA memory managed?

2 Answers2

Linked

Related