A TF-IDF matrix shape is (number_of_documents, number_of_unique_words). So for each document you get a feature for each word from the dataset. It can get bloated for large datasets.
In your case
(100000 (docs) * 4000 (words) * 4 (np.float64 bytes))/1024**3 ~ 1.5 Gb
Moreover, the Scipy TfidfVectorizer by default tries to compensate it using a sparse matrix (scipy.sparse.csr.csr_matrix). Even for long documents the matrix tends to contain lots of zeros. So it is usually an order less than the original size. If I am correct, it should be lower than 1.5 GB.
Thus is the question. Do you really have only 4000 words in your model (controlled by TfidfVectorizer(max_features=4000)
?
If you don't care about individual word frequencies you can decrease the vector size using PCA or other techniques.
dense_matrix = tf_idf_matrix.todense()
components_number = 300
reduced_data = PCA(n_components=300).fit_transform(dense_matrix)
Or you can use something like doc2vec. https://radimrehurek.com/gensim/models/doc2vec.html
Using it you'll get the matrix of the shape (number_of_documents, embedding_size). The embedding size is usually in the range between (100 and 600). You can train a doc2vec model without storing individual word vectors using the dbow_words
parameter.
If you care about individual word features, the only reasonable solution that I see is to decrease the amount of words.
Relevant stackoverflow posts:
----On dimensinality reduction
How do i visualize data points of tf-idf vectors for kmeans clustering?
----On using generators to train TFIDF
Sklearn TFIDF on large corpus of documents
How to get tf-idf matrix of a large size corpus, where features are pre-specified?
tf-idf on a somewhat large (65k) amount of text files
Models itself should not occupy so much space. I suppose it is possible if only you have some heavy objects in TfidfVectorizer
tokenizer
or preprocessor
attributes.
class Tokenizer:
def __init__(self):
self.s = np.random.uniform(0,1, size=(10000,10000))
def tokenizer(self, text):
text = text.lower().split()
return text
tokenizer = Tokenizer()
vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenizer)
pickle.dump(vectorizer, open("vectorizer.pcl", "wb"))
This will occupy more than 700mb after pickling.