PyTorch CUDA Out Of Memory error when running multiple passes of inference

Question

The issue

I am trying to run inference using a sentence-transformers model on all rows of the scientific_papers/pubmed dataset.

After 177 iterations of the attached code, I get the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB (GPU 0; 8.00 GiB total capacity; 4.92 GiB already allocated; 1.31 GiB free; 4.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The code

import transformers
import datasets
import torch
import nltk

dataset = datasets.load_dataset('scientific_papers', 'pubmed', split='train').shuffle(seed=1)

tokenizer = transformers.RobertaTokenizerFast.from_pretrained("sentence-transformers/all-distilroberta-v1")
model = transformers.AutoModel.from_pretrained("sentence-transformers/all-distilroberta-v1")
model.cuda()

def inference(document):
    # Split the document into sentences
    sentences = nltk.sent_tokenize(document)
    
    tokenized_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt").to('cuda')
    with torch.no_grad():
        model(**tokenized_sentences)

for e in range(len(dataset)):
    print('Iteration {}'.format(e))
    if (len(dataset[e]['article']) > 0):
        inference(dataset[e]['article'])

Things I've tried

Instantiating a fresh model in each pass. Specifically, I changed the inference() function like this:

def inference(document):
    # Split the document into sentences
    sentences = nltk.sent_tokenize(document)
    
    tokenized_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt").to('cuda')
    model = transformers.AutoModel.from_pretrained("sentence-transformers/all-distilroberta-v1")
    model.cuda()
    with torch.no_grad():
        model(**tokenized_sentences)

...but the script kept running into the exact same error (memory usage didn't change). This was really striking.

Calling torch.cuda.empty_cache() after each iteration. I thought that the behaviour described above may have been related to cache management issues, but this line had no effect, so I doubt it.
Checking the size of the example where I get the error. I added a call to print(tokenized_sentences['input_ids'].size()) to check if the document was anormously large (which shouldn't happen anyway, because I have enabled truncation). But the size was similar to other examples.

I'm out of ideas at this point. I'm not sure what else could be causing the error. Of course, not using CUDA solves the problem, but makes inference painfully slow.

PyTorch CUDA Out Of Memory error when running multiple passes of inference

The issue

The code

Things I've tried

0 Answers0