10

I have a question related to gensim. I like to know whether it is recommended or necessary to use pickle while saving or loading a model (or multiple models), as I find scripts on GitHub that do either.

mymodel = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)
      mymodel.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

See here

Variant 1:

import pickle
# Save
mymodel.save("mymodel.pkl")  # Stores *.pkl file
# Load
mymodel = pickle.load("mymodel.pkl")

Variant 2:

# Save
model.save(mymodel) # Stores *.model file
# Load
model = Doc2Vec.load(mymodel)

In gensim.utils, it appears to me that there is a pickle function embedded: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py

def save ... try: _pickle.dump(self, fname_or_handle, protocol=pickle_protocol) ...

Goal of my question: I would be glad to learn 1) whether I need pickle (for better memory management) and 2) in case, why it's better than loading *.model files.

Thank you!

Johnny Kay
  • 103
  • 1
Christopher
  • 2,120
  • 7
  • 31
  • 58
  • 1
    In variant 1, loading a saved Doc2Vec model with `pickle.load` fails for me because files stored with a model's `save` method do not have a readline attribute. Does it actually work for you? In variant 2, I assume that a) `mymodel` is a path to a file with the extension "model" and not the Doc2Vec model created earlier and b) `model` is a Doc2Vec model. Are these assumption correct? – WolfgangK Jun 11 '18 at 06:01

2 Answers2

6

Whenever you store a model using the built-in gensim function save(), pickle is being used regardless of the file extension. The documentation for utils tells us this:

class gensim.utils.SaveLoad

Bases: object

Class which inherit from this class have save/load functions, which un/pickle them to disk.

Warning

This uses pickle for de/serializing, so objects must not contain unpicklable attributes, such as lambda functions etc.

So gensim will use pickle to save any model as long as the model class inherits from the gensim.utils.SaveLoad class. In your case gensim.models.doc2vec.Doc2Vec inherits from gensim.models.base_any2vec.BaseWordEmbeddingsModel which in turn inherits from gensim.utils.SaveLoad which provides the actual save() function.

To answer your questions:

  1. Yes, you need pickle unless you want to write your own function for storing your models to disk. Using pickle should not be problematic though since it is in the standard library. You won't even notice it.
  2. If you use the gensim save() function you can chose any file extension: *.model, *.pkl, *.p, *.pickle. The saved file will be pickled.
WolfgangK
  • 953
  • 11
  • 18
2

It depends what are your requirements.

When you going to use the data with Python and you don't need to change between python versions (I experienced some problems with porting from python 2 to python 3 using pickled models) a binary format will be a good choice.

If you want interoperability or this model could be used by in the other projects or by other programmers I would use gensim's save method.

l.augustyniak
  • 1,794
  • 1
  • 15
  • 15