2

I have been experiencing a thing I am not able to explain myself. I want to find the impact in memory of a spacy model inside a Python process.

The question I am trying to answer is how many workers I can spin up with gunicorn given that each worker will have to load its own spacy model, the total amount of memory available in the Kubernetes node, and the max no. of replicas for the service when scaling.

To do so, I have the following script that

  • loads a spacy model using spacy.load(),
  • check memory usage using psutil.

I make sure to run pkill -f python before I run the script to start from a clean state.

import psutil, os, time, sys

import spacy

B_TO_MB_CONVERSION_UNIT = 0.00000095367432

if __name__ == "__main__":
    process = psutil.Process(os.getpid())
    nlp = spacy.load(
        "en_core_web_lg",
        exclude=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"],
    )
    print(f"Loaded components: { nlp.pipe_names }")
    start = time.time()
    while True:
        print(process.memory_info())
        mem_usage = process.memory_info().rss * B_TO_MB_CONVERSION_UNIT
        print(f"Process takes {mem_usage} MB")
        now = time.time()
        if now - start > 10:
            sys.exit(0)

        time.sleep(1)

At first, I get this output:

Loaded components: ['ner']
pmem(rss=784879616, vms=5812686848, pfaults=305254, pageins=672)
Process takes 748.5195340706612 MB
pmem(rss=784883712, vms=5812686848, pfaults=305255, pageins=672)
Process takes 748.523440320676 MB
pmem(rss=784883712, vms=5812686848, pfaults=305255, pageins=672)
Process takes 748.523440320676 MB
pmem(rss=784883712, vms=5812686848, pfaults=305255, pageins=672)
Process takes 748.523440320676 MB
[...]

but then I get this:

Loaded components: ['ner']
pmem(rss=1125773312, vms=5842046976, pfaults=303825, pageins=504)
Process takes 1073.621097795748 MB
pmem(rss=1125777408, vms=5842046976, pfaults=303826, pageins=504)
Process takes 1073.6250040457626 MB
pmem(rss=1125777408, vms=5842046976, pfaults=303826, pageins=504)
Process takes 1073.6250040457626 MB
pmem(rss=1125777408, vms=5842046976, pfaults=303826, pageins=504)
Process takes 1073.6250040457626 MB

Further, I get this:

Loaded components: ['ner']
pmem(rss=909500416, vms=5858304000, pfaults=316778, pageins=3363)
Process takes 867.3984407686349 MB
pmem(rss=909541376, vms=5858304000, pfaults=316793, pageins=3364)
Process takes 867.4062532686644 MB
pmem(rss=909541376, vms=5858304000, pfaults=316793, pageins=3364)
Process takes 867.4062532686644 MB
pmem(rss=909541376, vms=5858304000, pfaults=316793, pageins=3364)
Process takes 867.4062532686644 MB

What am I missing or doing wrong here? I honestly thought then the same model would take the same amount of memory, give or take.

Mattia Paterna
  • 1,268
  • 3
  • 15
  • 31
  • 1
    I think this has to do with RSS being a little wonky and depending on other processes and OS RAM magic. If you aren't making any documents a spaCy model will load the same data each time. – polm23 Aug 17 '21 at 10:00
  • OK, that is interesting. I only do what the code shows. I actually measure memory this way as suggested from [this issue](https://github.com/explosion/spaCy/issues/4432#issue-505823125). Do you have any alternative way of doing the same that you would suggest or recommend? – Mattia Paterna Aug 17 '21 at 10:02

0 Answers0