sklearn TfidfVectorizer giving MemoryError

Question

I get a MemoryError: Unable to allocate 61.4 GiB for an array with shape (50000, 164921) and data type float64:

tfidf = TfidfVectorizer(analyzer=remove_stopwords)

X = tfidf.fit_transform(df['lemmatize'])
print(X.shape)

Output :  (50000, 164921)

Now,here comes the memory error

df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names())

MemoryError: Unable to allocate 61.4 GiB for an array with shape (50000, 164921) and data type float64

you can set parameter for the `TfidfVectorizer` to handle memory such as `max_features` . — Peacepieceonepiece, Sep 15 '21 at 08:29
`TfidfVectorizer` isn't pandas, it's [`sklearn.feature_extraction.text.TfidfVectorizer.html`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Please fix your title and tags. Anyway, it has several settings to limit its memory use, `max_features` being the main one; also `min_df, max_df`. — smci, Feb 25 '22 at 04:48
Also, you need to add the missing import that shows where `TfidfVectorizer` comes from. (Questions are required to contain an MCVE: [mcve]) — smci, Feb 25 '22 at 04:58
a) **Don't convert the sparse array returned by sklearn into a pandas dataframe or dense array, that will blow up memory usage: `pd.DataFrame(X.toarray()...)` is asking for trouble.** Duplicate of [Converting TfidfVectorizer sparse matrix to dataframe or dense array results in memory error](https://stackoverflow.com/questions/48886671/converting-tfidfvectorizer-sparse-matrix-to-array-results-in-memory-error) b) In any case, never run `TfidfVectorizer` without `max_features` — smci, Feb 25 '22 at 05:09
Tell us what downstream NLP tasks you're trying to perform on the TfidfVectorizer output X. What are your next 20 lines of code? Then, **figure out how to implement that natively on the sparse array**. — smci, Feb 25 '22 at 05:11

score 0 · Answer 1 · answered Feb 20 '21 at 14:43

0

You are running low in memory, but there are chances of getting it done by changing the datatype from float64 to uint8. Please try this and let me know if it throws the same error again.

df = pd.DataFrame(np.array(X).astype(np.uint8))

answered Feb 20 '21 at 14:43

Amardeep Flora

1,255
6
13
29

The data contains imdb movie reviews and i tried changing the datatype to uint8 still it gives the same error. Also there is over 472 GB of free space on the disk. – Gautam Jha Feb 20 '21 at 14:54
your Python and machine is 32-bit or 64-bit? – Amardeep Flora Feb 20 '21 at 15:03
I am using python 3.9 64 bit – Gautam Jha Feb 20 '21 at 15:16

score 0 · Answer 2 · answered Feb 25 '22 at 04:43

We can limit data being processed. For instance the windowing function for the first 1000 records.

# Creating Document Term Matrix
# I'll use the word matrix as a different view on the data
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(analyzer='word')
data=cv.fit_transform(df_grouped['lemmatized_h'][:1000]) 
# I put limits on the data being fed into the Count vectorizer because of my limited ram


df_dtm = pd.DataFrame(data.toarray(), columns=cv.get_feature_names())
# df_dtm = pd.DataFrame(np.array(data).astype(np.uint8), columns=cv.get_feature_names())
df_dtm.index=df_grouped.index[:1000]
df_dtm.head(3)

sklearn TfidfVectorizer giving MemoryError

2 Answers2