1

I'm getting cuda out of memory error. The error is shown below

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 23.65 GiB total capacity; 21.65 GiB already allocated; 242.88 MiB free; 22.55 GiB reserved in total by PyTorch)

I'm able to train the model after I reduce the batch_size. When I checked the output of nvidia-smi, I see a 40% of memory is still free. Here is the output. enter image description here

What could be possible reason for this?
pytorch: 1.4
cuda: 10.2
input_size: (512, 512, 4)
using half-precision

More information: The plot of gpu-utilization is shown below enter image description here

The numbers at each peak represent the batch_size. It seems the initial memory requirement is much higher than memory needed afterward. Can someone explain?

Community
  • 1
  • 1
pauli
  • 4,191
  • 2
  • 25
  • 41
  • It seems to me that there is 22.55 GiB reserved in total by PyTorch - 21.65 GiB already allocated = 242.88 MiB free. It then tries to allocate 256.00 MiB - as in a new allocation, not using the already-allocated memory, and as 256>242, it fails. – Andrew Morton Apr 23 '20 at 08:37
  • I understand that but why `nvidia-smi ` is showing GPU memory usage close to 13.5 GB? – pauli Apr 23 '20 at 08:40
  • [What's the difference between nvidia-smi Memory-Usage and GPU Memory Usage?](https://stackoverflow.com/a/55013754/1115360) says "[the] difference, it is due to GPU memory consumption not associated with that process (e.g. in use by CUDA itself)." – Andrew Morton Apr 23 '20 at 08:57
  • 2
    It does not make much sense to use nvidia-smi to see how much RAM is being used by your program, as it shows usage at the moment you execute it, and RAM usage will vary as the program is executed. The memory required by the program is clearly more than what is available. – Dr. Snoopy Apr 23 '20 at 09:27
  • Pytorch almost certainly tried to allocate more memory than was available. In reality the total GPU memory usage went all the way up to nearly 24GB, then, because no more was available pytorch crashed and therefore the memory that process had allocated was freed. This happens pretty fast which is probably why you didn't see the result on nvidia-smi which only reports the state of the GPU when queried. – jodag Apr 23 '20 at 10:01
  • Sorry, If I didn't make it clear. The `nvidia-smi` output is after reducing the `batch_size` from 16 to 8 and running the model. At `batch_size=16` I was getting cuda out of memory error but at `batch_size=8` 40% memory is unallocated. It is very unlikely that the model adds more than 10 GB memory for 8 more images. – pauli Apr 23 '20 at 14:52

0 Answers0