0

In the following code on Google Colab when it reaches to the toarray method, it uses all the RAM. I looked for an answer and it's been suggested the use of HashingVectorizer. How can I implement it in the following code?

The shape of cv.fit_transform(data_list) is (324430, 351550)

# Loading the dataset
data = pd.read_csv("Language Detection.csv")
# value count for each language
data["Language"].value_counts()
# separating the independent and dependant features
X = data["Text"]
y = data["Language"]
# converting categorical variables to numerical
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
# creating a list for appending the preprocessed text
data_list = []
# iterating through all the text
for text in X:
    # removing the symbols and numbers
    text = re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text)
    text = re.sub(r'[[]]', ' ', text)
    # converting the text to lower case
    text = text.lower()
    # appending to data_list
    data_list.append(text)
# creating bag of words using countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(data_list).toarray()
#train test splitting
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
#model creation and prediction
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)
Asdoost
  • 316
  • 1
  • 2
  • 15

1 Answers1

1

Just don't use toarray. The output of the count vectorizer is a sparse matrix, which MultinomialNB should handle fine it seems.

If you really want to use hashing, you should just be able to replace CountVectorizer by HashingVectorizer.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • I saw this answer in other threads and follow it. When I ran `model.fit(x_train, y_train)` it executed very fast and returned `MultinomialNB()`. and I thought "That was fast!, it must be something wrong." and I didn't ran the rest of code. Now that you suggest the same thing, I ran it again and get the same result. but this time I executed the rest of the code. It turns out this is the answer and the fast execution of `model.fit(x_train, y_train)` fooled me. Thank you. – Asdoost Jul 04 '22 at 19:42