23

I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:

tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]

def identity_tokenizer(text):
  return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')    
tfidf.fit_transform(tokenized_list_of_sentences)

which errors out as

AttributeError: 'list' object has no attribute 'lower'

is there a way to do this? I have a billion sentences and do not want to tokenize them again. They are tokenized before for another stage before this.

greenberet123
  • 1,351
  • 1
  • 12
  • 22

4 Answers4

27

Try initializing the TfidfVectorizer object with the parameter lowercase=False (assuming this is actually desired as you've lowercased your tokens in previous stages).

tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]

def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
tfidf.fit_transform(tokenized_list_of_sentences)

Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.

Ufos
  • 3,083
  • 2
  • 32
  • 36
pmlk
  • 670
  • 7
  • 17
  • Any idea on how can I save and load the TfidfVectorizer object? If I'm using an external function such as the one in this example? I'm getting errors while trying to load it. – Lior Magen Jan 27 '20 at 14:35
2

Try preprocessor instead of tokenizer.

    return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

If x in the above error message is a list, then doing x.lower() to a list will throw the error.

Your two examples are all stopwords so to make this example return something, throw in a few random words. Here's an example:

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(preprocessor=' '.join, stop_words='english')
tfidf.fit_transform(tokenized_sentences)

Returns:

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>

Features:

>>> tfidf.get_feature_names()
['cat', 'dog']

UPDATE: maybe use lambdas on tokenizer and preprocessor?

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(tokenizer=lambda x: x,
                        preprocessor=lambda x: x, stop_words='english')
tfidf.fit_transform(tokenized_sentences)

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> tfidf.get_feature_names()
['cat', 'dog']
Jarad
  • 17,409
  • 19
  • 95
  • 154
1

Like @Jarad said just use a "passthrough" function for your analyzer but it needs to ignore stopwords. You can get stop words from sklearn:

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

or from nltk:

>>> import nltk
>>> nltk.download('stopwords')
>>> from nltk.corpus import stopwords
>>> stop_words = set(stopwords.words('english'))

or combine both sets:

stop_words = stop_words.union(ENGLISH_STOP_WORDS)

But then your examples contain only stop words (because all your words are in the sklearn.ENGLISH_STOP_WORDS set).

Noetheless @Jarad's examples work:

>>> tokenized_list_of_sentences =  [
...     ['this', 'is', 'one', 'cat', 'or', 'dog'],
...     ['this', 'is', 'another', 'dog']]
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf = TfidfVectorizer(analyzer=lambda x:[w for w in x if w not in stop_words])
>>> tfidf_vectors = tfidf.fit_transform(tokenized_list_of_sentences)

I like pd.DataFrames for browsing TF-IDF vectors:

>>> import pandas as pd
>>> pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.vocabulary_)
        cat       dog 
0  0.814802  0.579739
1  0.000000  1.000000
hobs
  • 18,473
  • 10
  • 83
  • 106
1

The comments above make perfect sense, but the hardest part comes when you start trying serializing and de-serializing the model. The solution proposed by @pmlk will give you an error if you would serialize the model as joblib.dump(tfidf, 'tfidf.joblib'):

tfidf = joblib.load('tfidf.joblib')

    AttributeError                            Traceback (most recent call last)
    <ipython-input-3-9093b9496059> in <module>()
    ----> 1 tfidf = load('tfidf.joblib')
    <...>
    AttributeError: module '__main__' has no attribute 'identity_tokenizer'

So as @Jarad mentioned, its better to use the lambda function as a tokenizer and a preprocessor. Lambdas can be serialized using dill (not with standard pickle/joblib though):

tfidf = TfidfVectorizer(tokenizer=lambda x: x,
                        preprocessor=lambda x: x, stop_words='english')
tfidf.fit_transform(tokenized_sentences)
with open('tfidf.dill', 'wb') as f:
    dill.dump(tfidf, f)

And then you can load the model without any issues:

with open('tfidf.dill', 'rb') as f:
    q = dill.load(f)

In most cases, it is safer to serialize only the vocabulary rather than the whole model - but if you for certain reasons can not do it (like me), the usage of lambdas and dill might be the solution. Cheers!

Amir
  • 1,926
  • 3
  • 23
  • 40