1

I get a MemoryError: Unable to allocate 61.4 GiB for an array with shape (50000, 164921) and data type float64:

tfidf = TfidfVectorizer(analyzer=remove_stopwords)

X = tfidf.fit_transform(df['lemmatize'])
print(X.shape)

Output :  (50000, 164921)

Now,here comes the memory error

df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names())

MemoryError: Unable to allocate 61.4 GiB for an array with shape (50000, 164921) and data type float64

smci
  • 32,567
  • 20
  • 113
  • 146
Gautam Jha
  • 29
  • 1
  • 6
  • 1
    you can set parameter for the `TfidfVectorizer` to handle memory such as `max_features` . – Peacepieceonepiece Sep 15 '21 at 08:29
  • `TfidfVectorizer` isn't pandas, it's [`sklearn.feature_extraction.text.TfidfVectorizer.html`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Please fix your title and tags. Anyway, it has several settings to limit its memory use, `max_features` being the main one; also `min_df, max_df`. – smci Feb 25 '22 at 04:48
  • Also, you need to add the missing import that shows where `TfidfVectorizer` comes from. (Questions are required to contain an MCVE: [mcve]) – smci Feb 25 '22 at 04:58
  • a) **Don't convert the sparse array returned by sklearn into a pandas dataframe or dense array, that will blow up memory usage: `pd.DataFrame(X.toarray()...)` is asking for trouble.** Duplicate of [Converting TfidfVectorizer sparse matrix to dataframe or dense array results in memory error](https://stackoverflow.com/questions/48886671/converting-tfidfvectorizer-sparse-matrix-to-array-results-in-memory-error) b) In any case, never run `TfidfVectorizer` without `max_features` – smci Feb 25 '22 at 05:09
  • Tell us what downstream NLP tasks you're trying to perform on the TfidfVectorizer output X. What are your next 20 lines of code? Then, **figure out how to implement that natively on the sparse array**. – smci Feb 25 '22 at 05:11

2 Answers2

0

You are running low in memory, but there are chances of getting it done by changing the datatype from float64 to uint8. Please try this and let me know if it throws the same error again.

df = pd.DataFrame(np.array(X).astype(np.uint8))
Amardeep Flora
  • 1,255
  • 6
  • 13
  • 29
0

We can limit data being processed. For instance the windowing function for the first 1000 records.

# Creating Document Term Matrix
# I'll use the word matrix as a different view on the data
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(analyzer='word')
data=cv.fit_transform(df_grouped['lemmatized_h'][:1000]) 
# I put limits on the data being fed into the Count vectorizer because of my limited ram


df_dtm = pd.DataFrame(data.toarray(), columns=cv.get_feature_names())
# df_dtm = pd.DataFrame(np.array(data).astype(np.uint8), columns=cv.get_feature_names())
df_dtm.index=df_grouped.index[:1000]
df_dtm.head(3)
Teddy
  • 1
  • 1