The issue
I am trying to run inference using a sentence-transformers
model on all rows of the scientific_papers/pubmed
dataset.
After 177 iterations of the attached code, I get the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB (GPU 0; 8.00 GiB total capacity; 4.92 GiB already allocated; 1.31 GiB free; 4.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The code
import transformers
import datasets
import torch
import nltk
dataset = datasets.load_dataset('scientific_papers', 'pubmed', split='train').shuffle(seed=1)
tokenizer = transformers.RobertaTokenizerFast.from_pretrained("sentence-transformers/all-distilroberta-v1")
model = transformers.AutoModel.from_pretrained("sentence-transformers/all-distilroberta-v1")
model.cuda()
def inference(document):
# Split the document into sentences
sentences = nltk.sent_tokenize(document)
tokenized_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt").to('cuda')
with torch.no_grad():
model(**tokenized_sentences)
for e in range(len(dataset)):
print('Iteration {}'.format(e))
if (len(dataset[e]['article']) > 0):
inference(dataset[e]['article'])
Things I've tried
- Instantiating a fresh model in each pass. Specifically, I changed the
inference()
function like this:
def inference(document):
# Split the document into sentences
sentences = nltk.sent_tokenize(document)
tokenized_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt").to('cuda')
model = transformers.AutoModel.from_pretrained("sentence-transformers/all-distilroberta-v1")
model.cuda()
with torch.no_grad():
model(**tokenized_sentences)
...but the script kept running into the exact same error (memory usage didn't change). This was really striking.
Calling
torch.cuda.empty_cache()
after each iteration. I thought that the behaviour described above may have been related to cache management issues, but this line had no effect, so I doubt it.Checking the size of the example where I get the error. I added a call to
print(tokenized_sentences['input_ids'].size())
to check if the document was anormously large (which shouldn't happen anyway, because I have enabled truncation). But the size was similar to other examples.
I'm out of ideas at this point. I'm not sure what else could be causing the error. Of course, not using CUDA solves the problem, but makes inference painfully slow.