31

Starting with zero usage:

>>> import gc
>>> import GPUtil
>>> import torch
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% |  0% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Then I create a big enough tensor and hog the memory:

>>> x = torch.rand(10000,300,200).cuda()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% | 26% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Then I tried several ways to see if the tensor disappears.

Attempt 1: Detach, send to CPU and overwrite the variable

No, doesn't work.

>>> x = x.detach().cpu()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% | 26% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Attempt 2: Delete the variable

No, this doesn't work either

>>> del x
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% | 26% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Attempt 3: Use the torch.cuda.empty_cache() function

Seems to work, but it seems that there are some lingering overheads...

>>> torch.cuda.empty_cache()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% |  5% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Attempt 4: Maybe clear the garbage collector.

No, 5% is still being hogged

>>> gc.collect()
0
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% |  5% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Attempt 5: Try deleting torch altogether (as if that would work when del x didn't work -_- )

No, it doesn't...*

>>> del torch
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% |  5% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

And then I tried to check gc.get_objects() and it looks like there's still quite a lot of odd THCTensor stuff inside...

Any idea why is the memory still in use after clearing the cache?

akshayk07
  • 2,092
  • 1
  • 20
  • 32
alvas
  • 115,346
  • 109
  • 446
  • 738
  • First, confirm which process is using the GPU memory using `nvidia-smi`. There the process id `pid` can be used to find the process. If no processes are shown but GPU memory is still being used, you can try [this method](https://devtalk.nvidia.com/default/topic/958159/cuda-programming-and-performance/11-gb-of-gpu-ram-used-and-no-process-listed-by-nvidia-smi/) to clear the memory. – akshayk07 Aug 14 '19 at 14:31

3 Answers3

17

It looks like PyTorch's caching allocator reserves some fixed amount of memory even if there are no tensors, and this allocation is triggered by the first CUDA memory access (torch.cuda.empty_cache() deletes unused tensor from the cache, but the cache itself still uses some memory).

Even with a tiny 1-element tensor, after del and torch.cuda.empty_cache(), GPUtil.showUtilization(all=True) reports exactly the same amount of GPU memory used as for a huge tensor (and both torch.cuda.memory_cached() and torch.cuda.memory_allocated() return zero).

Sergii Dymchenko
  • 6,890
  • 1
  • 21
  • 46
15

From the PyTorch docs:

Memory management

PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use memory_allocated() and max_memory_allocated() to monitor memory occupied by tensors, and use memory_cached() and max_memory_cached() to monitor memory managed by the caching allocator. Calling empty_cache() releases all unused cached memory from PyTorch so that those can be used by other GPU applications. However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch.

I bolded a part mentioning nvidia-smi, which as far as I know is used by GPUtil.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
Stanowczo
  • 753
  • 5
  • 21
  • Torch might also allocate memory for things like an internal system / manager. Probably its allocator. – Tiphaine Aug 20 '19 at 16:03
  • What about the sentence about `empty_cache()`? For me it clearly sounds like it really frees memory, that is it is free also in a "nvidia-smi-free" way. – BlameTheBits Aug 21 '19 at 10:14
  • Indeed, but unfortunately it is also stated that it frees only memory considered by pytorch as unused... This might be this lingering 5%. – Stanowczo Aug 21 '19 at 10:20
  • Yes. There must be something internal as @Tiphaine suggested too. But these 5% of OP are about half a Gigabyte! And then again the doc only knows "unused cache memory" and "occupied GPU memory by tensors" which these 5% seem to neither belong to. – BlameTheBits Aug 21 '19 at 10:28
  • Also note that cuda and torch allocate some memory for kernels, so even if you make a very tiny tensor on gpu, it can still take up to 1.5GB GPU memory: https://github.com/pytorch/pytorch/issues/12873#issuecomment-482916237 This memory is not considered as allocated or reserved by pytorch's allocator – chiragjn Dec 17 '21 at 11:12
  • `memory_cached()` and `max_memory_cached()` are now called `memory_reserved()` and `max_memory_reserved()`. – Maikefer Apr 24 '23 at 11:38
8

thanks for sharing this! I am running in the same problem and I used your example to debug. Basically, my findings are:

  • collect() and empty_cache() only work after deleting variables
  • del var + empty_cache() free cached and allocated memory
  • del var + collect() free only allocated memory
  • either way, there's still some overhead memory usage visible from nvidia-smi

Here's some code to reproduce the experiment:

    
import gc
import torch

def _get_less_used_gpu():
    from torch import cuda
    cur_allocated_mem = {}
    cur_cached_mem = {}
    max_allocated_mem = {}
    max_cached_mem = {}
    for i in range(cuda.device_count()):
        cur_allocated_mem[i] = cuda.memory_allocated(i)
        cur_cached_mem[i] = cuda.memory_reserved(i)
        max_allocated_mem[i] = cuda.max_memory_allocated(i)
        max_cached_mem[i] = cuda.max_memory_reserved(i)
    print(cur_allocated_mem)
    print(cur_cached_mem)
    print(max_allocated_mem)
    print(max_cached_mem)
    min_all = min(cur_allocated_mem, key=cur_allocated_mem.get)
    print(min_all)
    return min_all

x = torch.rand(10000,300,200, device=0)

# see memory usage
_get_less_used_gpu()
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: 3416MiB

# try delete with empty_cache()
torch.cuda.empty_cache()
_get_less_used_gpu()
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: 3416MiB

# try delete with gc.collect()
gc.collect()
_get_less_used_gpu()
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: 3416MiB

# try del + gc.collect()
del x 
gc.collect()
_get_less_used_gpu()
>{0: **0**, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: 3416MiB

# try empty_cache() after deleting 
torch.cuda.empty_cache()
_get_less_used_gpu()
>{0: 0, 1: 0, 2: 0, 3: 0}
>{0: **0**, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: **1126MiB**

# re-create obj and try del + empty_cache()
x = torch.rand(10000,300,200, device=0)
del x
torch.cuda.empty_cache()
_get_less_used_gpu()
>{0: **0**, 1: 0, 2: 0, 3: 0}
>{0: **0**, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: **1126MiB**

Nonetheless, this approach only applies when one knows exactly which variables are holding memory...which is not always the case when one trains deep learning modes I guess, especially when using third-party libraries.

Luca Clissa
  • 810
  • 2
  • 7
  • 27