Pickle Tfidfvectorizer along with a custom tokenizer

Question

I'm using a costume tokenizer to pass to TfidfVectorizer. That tokenizer depends on an external class TermExtractor, which is in another file.

I basically want to build a TfidVectorizer based on certain terms, and not all single words/tokens.

Here is to code to it:

from sklearn.feature_extraction.text import TfidfVectorizer
from TermExtractor import TermExtractor

extractor = TermExtractor()

def tokenize_terms(text):
    terms = extractor.extract(text)
    tokens = []
    for t in terms:
        tokens.append('_'.join(t))
    return tokens


def main(): 
    vectorizer = TfidfVectorizer(lowercase=True, min_df=2, norm='l2', smooth_idf=True, stop_words=stop_words, tokenizer=tokenize_terms)
    vectorizer.fit(corpus)
    pickle.dump(vectorizer, open("models/terms_vectorizer", "wb"))

This runs fine, but whenever I want to re-use this TfidfVectorizer and load it with pickle, I get an error:

vectorizer = pickle.load(open("models/terms_vectorizer", "rb"))

Traceback (most recent call last):
  File "./train-nps-comments-classifier.py", line 427, in <module>
    main()
  File "./train-nps-comments-classifier.py", line 325, in main
    vectorizer = pickle.load(open("models/terms_vectorizer", "rb"))
  File "/usr/lib/python2.7/pickle.py", line 1378, in load
    return Unpickler(file).load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/usr/lib/python2.7/pickle.py", line 1126, in find_class
    klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'tokenize_terms'

How does Python pickle works when there are dependent classes?

Just figure it out, I need to add the method tokenize_terms() in the same code that is loading the pickled TfidVectorizer, import the TermExtractor, and create an extractor: extractor = TermExtractor() — David Batista, Feb 04 '16 at 13:16

score 3 · Accepted Answer · answered Aug 05 '18 at 19:50

3

Just figure it out, I need to add the method tokenize_terms() in the same code that is loading the pickled TfidVectorizer, import the TermExtractor, and create an extractor:

extractor = TermExtractor()

answered Aug 05 '18 at 19:50

David Batista

3,029
2
23
42

score 0 · Answer 2 · answered Aug 21 '23 at 23:24

0

Also, you can try using the new drop-in replacement library called dill It is an extension of pickel library which does support many more serialization object types

answered Aug 21 '23 at 23:24

Jayanga Jayathilake

529
7
9

Pickle Tfidfvectorizer along with a custom tokenizer

2 Answers2