How to free GPU memory in PyTorch

Question

I have a list of sentences I'm trying to calculate perplexity for, using several models using this code:

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np
model_name = 'cointegrated/rubert-tiny'
model = AutoModelForMaskedLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

def score(model, tokenizer, sentence):
    tensor_input = tokenizer.encode(sentence, return_tensors='pt')
    repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
    mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
    masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
    labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
    with torch.inference_mode():
        loss = model(masked_input.cuda(), labels=labels.cuda()).loss
    return np.exp(loss.item())


print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer)) 
# 4.541251105675365

Most models work well, but some sentences seem to throw an error:

RuntimeError: CUDA out of memory. Tried to allocate 10.34 GiB (GPU 0; 23.69 GiB total capacity; 10.97 GiB already allocated; 6.94 GiB free; 14.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Which makes sense because some are very long. So what I did was to add something like try, except RuntimeError, pass.

This seemed to work until around 210 sentences, and then it just outputs the error:

CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I found this which had a lot of discussions and ideas, some were regarding potential faulty GPUs? But I know that my GPU works as this exact code works for other models. There's also talk about batch size here, which is why I thought it potentially relates to freeing up memory.

I tried running torch.cuda.empty_cache() to free the memory like in here after every some epochs but it didn't work (threw the same error).

Update: I filtered sentences with length over 550 and this seems to remove the CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. error.

How do you test different models? Do you execute one program per model or you simply loop through them in a single program? — Luca Clissa, Jan 05 '22 at 08:13
@LucaClissa I tried both methods to be honest. I had about 11 models to test, and 3 of them threw this error. For the remaining 8 I just did in a loop and they did just fine. — Penguin, Jan 05 '22 at 16:31
I see, I'll try to summarize my experience with similar issues in an answer below — Luca Clissa, Jan 06 '22 at 08:45

score 15 · Answer 1 · answered Dec 31 '21 at 10:47

15

You need to apply gc.collect() before torch.cuda.empty_cache() I also pull the model to cpu and then delete that model and its checkpoint. Try what works for you:

import gc

model.cpu()
del model, checkpoint
gc.collect()
torch.cuda.empty_cache()

answered Dec 31 '21 at 10:47

Abhi25t

3,703
3
19
32

A bit confused, when am I suppose to delete the model? Or `gc.collect()` for that matter? Should I do it every several sentences? – Penguin Dec 31 '21 at 15:21
^ I think you should do it whenever an error is raised. – Umang Gupta Dec 31 '21 at 19:46
1

So deleting the model doesn't seem like an option while it's going through the sentences because then I don't have a model to evaluate the remaining sentences. I tried running `gc.collect()` before `torch.cuda.empty_cache()` as you mentioned and it didn't seem to do anything (still got to about 210 sentences and had the same error) – Penguin Dec 31 '21 at 21:10
You should be able to infer millions of sentences on a single model, not just 210. Something seems wrong. I had to delete the model because I had to load new model for different inferences. Mine is different use case. – Abhi25t Jan 01 '22 at 04:36
1

`gc.collect()` also doesn't do anything for me – Alexis.Rolland Mar 15 '23 at 05:25
this worked for me without calling `model.cpu()` first i.e. `del model;gc.collect();torch.cuda.empty_cache()` – Matt Aug 25 '23 at 14:02

score 11 · Answer 2 · answered Jan 06 '22 at 11:05

I don't have an exact answer but I can share some troubleshooting techniques I adopted in similar situations...hope it may be helpful.

First, CUDA error is unfortunately vague sometimes so you should consider running your code on CPU to see if there is actually something else going on (see here )
If the problem is about memory, here are two custom utils I use:

from torch import cuda


def get_less_used_gpu(gpus=None, debug=False):
    """Inspect cached/reserved and allocated memory on specified gpus and return the id of the less used device"""
    if gpus is None:
        warn = 'Falling back to default: all gpus'
        gpus = range(cuda.device_count())
    elif isinstance(gpus, str):
        gpus = [int(el) for el in gpus.split(',')]

    # check gpus arg VS available gpus
    sys_gpus = list(range(cuda.device_count()))
    if len(gpus) > len(sys_gpus):
        gpus = sys_gpus
        warn = f'WARNING: Specified {len(gpus)} gpus, but only {cuda.device_count()} available. Falling back to default: all gpus.\nIDs:\t{list(gpus)}'
    elif set(gpus).difference(sys_gpus):
        # take correctly specified and add as much bad specifications as unused system gpus
        available_gpus = set(gpus).intersection(sys_gpus)
        unavailable_gpus = set(gpus).difference(sys_gpus)
        unused_gpus = set(sys_gpus).difference(gpus)
        gpus = list(available_gpus) + list(unused_gpus)[:len(unavailable_gpus)]
        warn = f'GPU ids {unavailable_gpus} not available. Falling back to {len(gpus)} device(s).\nIDs:\t{list(gpus)}'

    cur_allocated_mem = {}
    cur_cached_mem = {}
    max_allocated_mem = {}
    max_cached_mem = {}
    for i in gpus:
        cur_allocated_mem[i] = cuda.memory_allocated(i)
        cur_cached_mem[i] = cuda.memory_reserved(i)
        max_allocated_mem[i] = cuda.max_memory_allocated(i)
        max_cached_mem[i] = cuda.max_memory_reserved(i)
    min_allocated = min(cur_allocated_mem, key=cur_allocated_mem.get)
    if debug:
        print(warn)
        print('Current allocated memory:', {f'cuda:{k}': v for k, v in cur_allocated_mem.items()})
        print('Current reserved memory:', {f'cuda:{k}': v for k, v in cur_cached_mem.items()})
        print('Maximum allocated memory:', {f'cuda:{k}': v for k, v in max_allocated_mem.items()})
        print('Maximum reserved memory:', {f'cuda:{k}': v for k, v in max_cached_mem.items()})
        print('Suggested GPU:', min_allocated)
    return min_allocated


def free_memory(to_delete: list, debug=False):
    import gc
    import inspect
    calling_namespace = inspect.currentframe().f_back
    if debug:
        print('Before:')
        get_less_used_gpu(debug=True)

    for _var in to_delete:
        calling_namespace.f_locals.pop(_var, None)
        gc.collect()
        cuda.empty_cache()
    if debug:
        print('After:')
        get_less_used_gpu(debug=True)

2.1 free_memory allows you to combine gc.collect and cuda.empty_cache to delete some desired objects from the namespace and free their memory (you can pass a list of variable names as the to_delete argument). This is useful since you may have unused objects occupying memory. For example, imagine you loop through 3 models, then the first one may still take some gpu memory when you get to the second iteration (I don't know why, but I've experienced this in notebooks and the only solution I could find was to either restart the notebook or explicitly free memory). However, I have to say that is not always practical as you need to know which variables are holding GPU memory...and that's not always the case, especially when you have a lot of gradients internally associated with the model (see here for more info). One thing you could also try is to use with torch.no_grad(): instead of with torch.inference_mode():; they should be equivalent but it may be worth a try...

2.2 in case you have a multigpu environment you could consider alternately switching to the less used GPU thanks to the other utils, get_less_used_gpu

Also, you can try to track GPU usage to see when the error happens and debug from there. The best/simplest way I can suggest is using nvtop if you are on a linux platform

Hope something turns out to be useful :)

Can you give an example of how you could call your free_memory function, e.g. after getting a 'CUDA out of memory' error? — slashdottir, Sep 22 '22 at 20:01
Well when you get CUDA OOM I'm afraid you can only restart the notebook/re-run your script. The idea behind `free_memory` is to free the GPU beforehand so to make sure you don't waste space for unnecessary objects held in memory. A typical usage for DL applications would be: 1. run your model, e.g. one config of hyperparams (or, in general, operations that require GPU usage); 2. free_memory ; 3. run your second model (or other GPU operations you need); — Luca Clissa, Sep 28 '22 at 09:05

How to free GPU memory in PyTorch

2 Answers2