3

there is a Memory leak when using pipe of en_core_web_trf model, I run the model using GPU with 16GB RAM, here is a sample of the code.

!python -m spacy download en_core_web_trf

import en_core_web_trf
nlp = en_core_web_trf.load()

#it's just an array of 100K sentences.
data = dataload()

for index, review in enumerate( nlp.pipe(data, batch_size=100) ):
    #doing some processing here
    if index % 1000: print(index)

this code cracks when reaching 31K, and raises OOM error.

CUDA out of memory. Tried to allocate 46.00 MiB (GPU 0; 11.17 GiB total capacity; 10.44 GiB already allocated; 832.00 KiB free; 10.72 GiB reserved in total by PyTorch)

I just use the pipeline to predict, not train any data or other stuff and tried with different batch sizes, but nothing happened, still, crash.

Your Environment

  • spaCy version: 3.0.5
  • Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • Pipelines: en_core_web_trf (3.0.0)
moro clash
  • 199
  • 1
  • 2
  • 6

1 Answers1

1

Lucky you with GPU - I am still trying to get thru the (torch GPU) DLL Hell on Windows :-). But it looks like Spacy 3 uses more GPU memory than Spacy 2 did - my 6GB GPU may have become useless.

That said, have you tried running your case without the GPU (and watching memory usage)?

Spacy 2 'leak' on large datasets is (mainly) due to growing vocabulary - each data row may add couple more words, and the suggested 'solution' is reloading the model and/or just the vocabulary every nnn rows. The GPU usage may have the same issue...

mbrunecky
  • 176
  • 1
  • 6