12

I was profiling my Cuda 4 program and it turned out that at some stage the running process used over 80 GiB of virtual memory. That was a lot more than I would have expected. After examining the evolution of the memory map over time and comparing what line of code it is executing it turned out that after these simple instructions the virtual memory usage bumped up to over 80 GiB:

  int deviceCount;
  cudaGetDeviceCount(&deviceCount);
  if (deviceCount == 0) {
    perror("No devices supporting CUDA");
  }

Clearly, this is the first Cuda call, thus the runtime got initialized. After this the memory map looks like (truncated):

Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000   89796   14716       0 r-x--  prg
0000000005db1000      12      12       8 rw---  prg
0000000005db4000      80      76      76 rw---    [ anon ]
0000000007343000   39192   37492   37492 rw---    [ anon ]
0000000200000000    4608       0       0 -----    [ anon ]
0000000200480000    1536    1536    1536 rw---    [ anon ]
0000000200600000 83879936       0       0 -----    [ anon ]

Now with this huge memory area mapped into virtual memory space.

Okay, its maybe not a big problem since reserving/allocating memory in Linux doesn't do much unless you actually write to this memory. But it's really annoying since for example MPI jobs have to be specified with the maximum amount of vmem usable by the job. And 80GiB that's s just a lower boundary then for Cuda jobs - one has to add all other stuff too.

I can imagine that it has to do with the so-called scratch space that Cuda maintains. A kind of memory pool for kernel code that can dynamically grow and shrink. But that's speculation. Also it's allocated in device memory.

Any insights?

ritter
  • 7,447
  • 7
  • 51
  • 84

1 Answers1

14

Nothing to do with scratch space, it is the result of the addressing system that allows unified andressing and peer to peer access between host and multiple GPUs. The CUDA driver registers all the GPU(s) memory + host memory in a single virtual address space using the kernel's virtual memory system. It isn't actually memory consumption, per se, it is just a "trick" to map all the available address spaces into a linear virtual space for unified addressing.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Ok, make sense. So, in my case the host has 48 GB RAM installed + 8 GB swap and 4 GPUs with 6 GB each makes in total 80 GB. Bingo! Does this mean if i allocate on normal heap, this would also appear inside these 80GB ? – ritter Jul 24 '12 at 13:43
  • Heap is just an allocation from each GPUs memory for each context. You won't see a change in virtual memory because of heap, but the free memory that the API report on a given GPU will decrease. – talonmies Jul 24 '12 at 14:03
  • 7
    @Frank: Can I have your computer? :) – Roger Dahl Jul 24 '12 at 16:39
  • So if Frank's computer has 48GB of RAM and CUDA demands 80GB of memory for its "trick", are you saying the swap file will balloon to 32GB? Where exactly is this 80GB allocated? – mchen Dec 31 '12 at 15:12
  • 1
    @MiloChen: Nowhere. The process simply maps all of the GPU(s) memory and all the system memory into a single *virtual* address space which the process can uniformly address. – talonmies Dec 31 '12 at 15:29
  • I'm trying to understand how it works in my case: I have 16G RAM, 4G swap and two GPUs with 6G and 1.5G memory. How comes my process needs 45G virtual memory? There is a single anonymous memory block of ~43G in `smaps`. – Pavel Jun 27 '14 at 17:05
  • how to disable it? In my app CUDA requires 50Gb paging file – iperov Jul 24 '21 at 05:31