1

I have roughly 100,000 long articles totally about 5GB of texts, when I perform

TfidfVectorizer

from sklearn it constructs a model with 6GB. How is that possible? Isn't that we only need to store the document frequency of that 4000 words and what that 4000 words are? I am guessing TfidfVectorizer of stores such 4000 dimension vector for every document. Is it possible somehow I have some settings wrongly set?

user40780
  • 1,828
  • 7
  • 29
  • 50

2 Answers2

2

A TF-IDF matrix shape is (number_of_documents, number_of_unique_words). So for each document you get a feature for each word from the dataset. It can get bloated for large datasets.

In your case (100000 (docs) * 4000 (words) * 4 (np.float64 bytes))/1024**3 ~ 1.5 Gb

Moreover, the Scipy TfidfVectorizer by default tries to compensate it using a sparse matrix (scipy.sparse.csr.csr_matrix). Even for long documents the matrix tends to contain lots of zeros. So it is usually an order less than the original size. If I am correct, it should be lower than 1.5 GB.

Thus is the question. Do you really have only 4000 words in your model (controlled by TfidfVectorizer(max_features=4000)?

If you don't care about individual word frequencies you can decrease the vector size using PCA or other techniques.

    dense_matrix = tf_idf_matrix.todense()
    components_number = 300
    reduced_data = PCA(n_components=300).fit_transform(dense_matrix)

Or you can use something like doc2vec. https://radimrehurek.com/gensim/models/doc2vec.html

Using it you'll get the matrix of the shape (number_of_documents, embedding_size). The embedding size is usually in the range between (100 and 600). You can train a doc2vec model without storing individual word vectors using the dbow_words parameter.

If you care about individual word features, the only reasonable solution that I see is to decrease the amount of words.

Relevant stackoverflow posts:

----On dimensinality reduction

How do i visualize data points of tf-idf vectors for kmeans clustering?

----On using generators to train TFIDF

Sklearn TFIDF on large corpus of documents

How to get tf-idf matrix of a large size corpus, where features are pre-specified?

tf-idf on a somewhat large (65k) amount of text files


Models itself should not occupy so much space. I suppose it is possible if only you have some heavy objects in TfidfVectorizer tokenizer or preprocessor attributes.

class Tokenizer: 
        def __init__(self): 
            self.s = np.random.uniform(0,1, size=(10000,10000)) 
        def tokenizer(self, text): 
            text = text.lower().split() 
            return text
    tokenizer = Tokenizer()                                                                                                                                                   
    vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenizer)
    pickle.dump(vectorizer, open("vectorizer.pcl", "wb"))

This will occupy more than 700mb after pickling.

Denis Gordeev
  • 454
  • 2
  • 9
  • Thanks! I still have a question. I actually mean the model size is large, not the transformed x. Do you mean that the model itself would store the transformed x of all of my input data? Is it possible to make it not to store such huge stuffs. I generally just take the model and transform new data. So I don't really want to store transformed x. – user40780 Nov 13 '19 at 16:41
  • Sorry. I have misunderstood your question. The TfidfVectorizer instance stores itself only vectorizer.vocabulary_ and vectorizer.idf_, the first one stores a dictionary of word counts, the second one stores a numpy array with idfs respectively. Both of them require <100 Mb even for large datasets. Could you provide your code? Does the model take so much space after pickling? The only reason I can think of is if you have something heavy in your custom tokenizer or preprocessor. – Denis Gordeev Nov 14 '19 at 14:29
  • I see. Thanks for your reply! my tfidf is storing ngrams = (1,2), huge amounts of documents (maybe with tons of mis-spellings). Does it store all possible bigrams that cause the size to be large? – user40780 Nov 14 '19 at 18:41
  • Sorry, I do not quite understand. Do you mean that you calculate tfidf for bigrams? TfidfVectorizer shouldn't take much space even in this case. Maybe, could you provide the code you invoke Tfidf-vectorizer with? – Denis Gordeev Nov 18 '19 at 12:44
2

I know there is an answer but some additional information to consider for others. When you directly pickle the TFIDFVectorizer you also saving stop words attribute of the vectorizer but that is not necessary after vocabulary is established. In one of our models, there were 3000 words in vocabulary but saved model occupied 250MB space so inspecting the model we saw 10 Million stop words also is stored with the model. Then we saw the following warning at TfidfVectorizer

"The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling."

Applying that reduced our model size significantly.

OldWolfs
  • 606
  • 6
  • 15