Using sklearn's toarray method results in the use of all RAM

Question

In the following code on Google Colab when it reaches to the toarray method, it uses all the RAM. I looked for an answer and it's been suggested the use of HashingVectorizer. How can I implement it in the following code?

The shape of cv.fit_transform(data_list) is (324430, 351550)

# Loading the dataset
data = pd.read_csv("Language Detection.csv")
# value count for each language
data["Language"].value_counts()
# separating the independent and dependant features
X = data["Text"]
y = data["Language"]
# converting categorical variables to numerical
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
# creating a list for appending the preprocessed text
data_list = []
# iterating through all the text
for text in X:
    # removing the symbols and numbers
    text = re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text)
    text = re.sub(r'[[]]', ' ', text)
    # converting the text to lower case
    text = text.lower()
    # appending to data_list
    data_list.append(text)
# creating bag of words using countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(data_list).toarray()
#train test splitting
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
#model creation and prediction
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

score 1 · Accepted Answer · answered Jul 04 '22 at 02:02

1

Just don't use toarray. The output of the count vectorizer is a sparse matrix, which MultinomialNB should handle fine it seems.

If you really want to use hashing, you should just be able to replace CountVectorizer by HashingVectorizer.

answered Jul 04 '22 at 02:02

Ben Reiniger

10,517
3
16
29

I saw this answer in other threads and follow it. When I ran `model.fit(x_train, y_train)` it executed very fast and returned `MultinomialNB()`. and I thought "That was fast!, it must be something wrong." and I didn't ran the rest of code. Now that you suggest the same thing, I ran it again and get the same result. but this time I executed the rest of the code. It turns out this is the answer and the fast execution of `model.fit(x_train, y_train)` fooled me. Thank you. – Asdoost Jul 04 '22 at 19:42

Using sklearn's toarray method results in the use of all RAM

1 Answers1