I have 2.2 million data samples to classify into more than 7500 categories. I am using pandas and sckit-learn of python to do so.
Below is the sample of my dataset
itemid description category
11802974 SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters Architectural Diffusers
10688548 ANTIQUE BRONZE FINISH PUSHBUTTON switch Door Bell Pushbuttons
9836436 Descente pour Cable tray fitting and accessories Tray Cable Drop Outs
Below are the steps I have followed:
- Pre-processing
- Vector representation
Training
dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1",low_memory=False) dataset['description']=dataset['description'].str.replace('[^a-zA-Z]', ' ') dataset['description']=dataset['description'].str.replace('[\d]', ' ') dataset['description']=dataset['description'].str.lower() stop = stopwords.words('english') lemmatizer = WordNetLemmatizer() dataset['description']=dataset['description'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ') dataset['description']=dataset['description'].str.replace('\s\s+',' ') dataset['description'] =dataset['description'].apply(word_tokenize) ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v' POS_LIST = [NOUN, VERB, ADJ, ADV] for tag in POS_LIST: dataset['description'] = dataset['description'].apply(lambda x: list(set([lemmatizer.lemmatize(item,tag) for item in x]))) dataset['description']=dataset['description'].apply(lambda x : " ".join(x)) countvec = CountVectorizer(min_df=0.0005) documenttermmatrix=countvec.fit_transform(dataset['description']) column=countvec.get_feature_names() y_train=dataset['category'] y_train=dataset['category'].tolist() del dataset del stop del tag
The documenttermmatrix generated will be of type scipy csr matrix with 12k features and 2.2 million samples.
For training I tried using xgboost of sckit learn
model = XGBClassifier(silent=False,n_estimators=500,objective='multi:softmax',subsample=0.8)
model.fit(documenttermmatrix,y_train,verbose=True)
After 2-3 minutes execution of above code i got error
OSError: [WinError 541541187] Windows Error 0x20474343
I also tried Naive Bayes of sckit learn for which i got memory error
Question
I have used Scipy matrix which consumes very less memory and also I am deleting all the unused objects before executing xgboost or Naive bayes, I am using system with 128GB RAM but still getting memory issue while training.
I am new to python.Is there any thing wrong in my code? can anyone tell how can I use Memory efficiently and proceed further?