0

I am using below code to construct document term matrix in python.

# Importing the libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer


dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1")
dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^\w\s]', ' ')
dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]', ' ')
dataset['ProductDescription']=dataset['ProductDescription'].str.lower()
stop = set(stopwords.words('english'))
dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')
vectorizer = CountVectorizer()
x1 = vectorizer.fit_transform(dataset['ProductDescription'].values.astype('U'))
df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())

For 10000 dataset the code is working fine but when i consider large dataset of around 1100000, I am getting memory error when i execute

  df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())

Can somebody please tell me where i have gone wrong?

Ranjana Girish
  • 473
  • 7
  • 17
  • Where are you using this last line? Why? Why are you making a sparse array to dense? Thats the source of error. – Vivek Kumar Oct 13 '17 at 06:00
  • I need to give dense matrix to random forest. I didn't find any implementation of random forest that use sparse matrix. – Ranjana Girish Oct 13 '17 at 06:08
  • Moreover After completion of x1 = vectorizer.fit_transform(dataset['ProductDescription'].values.astype('U')) ,I am getting memory error for anything I do. – Ranjana Girish Oct 13 '17 at 06:21
  • 1
    First, if you are using RandomForestClassifier, then it accepts sparse matrix. Second, try to use a more powerful machine if you can. – Vivek Kumar Oct 13 '17 at 07:59
  • @Vivek Kumar ,Thanks for your response. I am using system with 128GB RAM. After creation of sparse matrix. I am getting memory error , not able to do anything. – Ranjana Girish Oct 13 '17 at 13:16
  • I am not understanding. Can you explain a bit more. As a side note, please dont load whole data inside the dataframe if thats too much space. You can directly supply filename as input to CountVectorizer. – Vivek Kumar Oct 13 '17 at 13:43

0 Answers0