Use sklearn TfidfVectorizer with already tokenized inputs?

Question

I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:

tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]

def identity_tokenizer(text):
  return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')    
tfidf.fit_transform(tokenized_list_of_sentences)

which errors out as

AttributeError: 'list' object has no attribute 'lower'

is there a way to do this? I have a billion sentences and do not want to tokenize them again. They are tokenized before for another stage before this.

To be able to help, please add the full error message and add the minimum code required to reproduce the error. — Mohamed Ali JAMAOUI, Feb 07 '18 at 20:02

score 27 · Accepted Answer · edited Jan 19 '22 at 19:43

27

Try initializing the TfidfVectorizer object with the parameter lowercase=False (assuming this is actually desired as you've lowercased your tokens in previous stages).

tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]

def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
tfidf.fit_transform(tokenized_list_of_sentences)

Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.

edited Jan 19 '22 at 19:43

Ufos

3,083
2
32
36

answered Jun 05 '18 at 19:33

pmlk

670
7
17

Any idea on how can I save and load the TfidfVectorizer object? If I'm using an external function such as the one in this example? I'm getting errors while trying to load it. – Lior Magen Jan 27 '20 at 14:35

Jarad · Answer 2 · 2018-02-08T05:53:10.053

Try preprocessor instead of tokenizer.

    return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

If x in the above error message is a list, then doing x.lower() to a list will throw the error.

Your two examples are all stopwords so to make this example return something, throw in a few random words. Here's an example:

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(preprocessor=' '.join, stop_words='english')
tfidf.fit_transform(tokenized_sentences)

Returns:

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>

Features:

>>> tfidf.get_feature_names()
['cat', 'dog']

UPDATE: maybe use lambdas on tokenizer and preprocessor?

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(tokenizer=lambda x: x,
                        preprocessor=lambda x: x, stop_words='english')
tfidf.fit_transform(tokenized_sentences)

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> tfidf.get_feature_names()
['cat', 'dog']

This retokenizes the input that the preprocessor joined. I dont want to spend resources retokenizing again. — greenberet123, Feb 08 '18 at 04:03

score 1 · Answer 3 · answered Aug 15 '18 at 22:33

Like @Jarad said just use a "passthrough" function for your analyzer but it needs to ignore stopwords. You can get stop words from sklearn:

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

or from nltk:

>>> import nltk
>>> nltk.download('stopwords')
>>> from nltk.corpus import stopwords
>>> stop_words = set(stopwords.words('english'))

or combine both sets:

stop_words = stop_words.union(ENGLISH_STOP_WORDS)

But then your examples contain only stop words (because all your words are in the sklearn.ENGLISH_STOP_WORDS set).

Noetheless @Jarad's examples work:

>>> tokenized_list_of_sentences =  [
...     ['this', 'is', 'one', 'cat', 'or', 'dog'],
...     ['this', 'is', 'another', 'dog']]
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf = TfidfVectorizer(analyzer=lambda x:[w for w in x if w not in stop_words])
>>> tfidf_vectors = tfidf.fit_transform(tokenized_list_of_sentences)

I like pd.DataFrames for browsing TF-IDF vectors:

>>> import pandas as pd
>>> pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.vocabulary_)
        cat       dog 
0  0.814802  0.579739
1  0.000000  1.000000

score 1 · Answer 4 · answered Jun 22 '22 at 21:48

The comments above make perfect sense, but the hardest part comes when you start trying serializing and de-serializing the model. The solution proposed by @pmlk will give you an error if you would serialize the model as joblib.dump(tfidf, 'tfidf.joblib'):

tfidf = joblib.load('tfidf.joblib')

    AttributeError                            Traceback (most recent call last)
    <ipython-input-3-9093b9496059> in <module>()
    ----> 1 tfidf = load('tfidf.joblib')
    <...>
    AttributeError: module '__main__' has no attribute 'identity_tokenizer'

So as @Jarad mentioned, its better to use the lambda function as a tokenizer and a preprocessor. Lambdas can be serialized using dill (not with standard pickle/joblib though):

tfidf = TfidfVectorizer(tokenizer=lambda x: x,
                        preprocessor=lambda x: x, stop_words='english')
tfidf.fit_transform(tokenized_sentences)
with open('tfidf.dill', 'wb') as f:
    dill.dump(tfidf, f)

And then you can load the model without any issues:

with open('tfidf.dill', 'rb') as f:
    q = dill.load(f)

In most cases, it is safer to serialize only the vocabulary rather than the whole model - but if you for certain reasons can not do it (like me), the usage of lambdas and dill might be the solution. Cheers!

Use sklearn TfidfVectorizer with already tokenized inputs?

4 Answers4

Linked