text classification of large dataset in python

Question

I have 2.2 million data samples to classify into more than 7500 categories. I am using pandas and sckit-learn of python to do so.

Below is the sample of my dataset

itemid       description                                            category
11802974     SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters    Architectural Diffusers
10688548     ANTIQUE BRONZE FINISH PUSHBUTTON  switch           Door Bell Pushbuttons
9836436     Descente pour Cable tray fitting and accessories    Tray Cable Drop Outs

Below are the steps I have followed:

Pre-processing
Vector representation

Training

 dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1",low_memory=False)
 dataset['description']=dataset['description'].str.replace('[^a-zA-Z]', ' ')
 dataset['description']=dataset['description'].str.replace('[\d]', ' ')
 dataset['description']=dataset['description'].str.lower()

 stop = stopwords.words('english')
 lemmatizer = WordNetLemmatizer()

  dataset['description']=dataset['description'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')
  dataset['description']=dataset['description'].str.replace('\s\s+',' ')
  dataset['description'] =dataset['description'].apply(word_tokenize)
  ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
  POS_LIST = [NOUN, VERB, ADJ, ADV]
  for tag in POS_LIST:
  dataset['description'] = dataset['description'].apply(lambda x: 
  list(set([lemmatizer.lemmatize(item,tag) for item in x])))
  dataset['description']=dataset['description'].apply(lambda x : " ".join(x))


 countvec = CountVectorizer(min_df=0.0005)
 documenttermmatrix=countvec.fit_transform(dataset['description'])
 column=countvec.get_feature_names()

 y_train=dataset['category']
 y_train=dataset['category'].tolist()

 del dataset
 del stop
 del tag

The documenttermmatrix generated will be of type scipy csr matrix with 12k features and 2.2 million samples.

For training I tried using xgboost of sckit learn

model = XGBClassifier(silent=False,n_estimators=500,objective='multi:softmax',subsample=0.8)
model.fit(documenttermmatrix,y_train,verbose=True)

After 2-3 minutes execution of above code i got error

OSError: [WinError 541541187] Windows Error 0x20474343

I also tried Naive Bayes of sckit learn for which i got memory error

Question

I have used Scipy matrix which consumes very less memory and also I am deleting all the unused objects before executing xgboost or Naive bayes, I am using system with 128GB RAM but still getting memory issue while training.

I am new to python.Is there any thing wrong in my code? can anyone tell how can I use Memory efficiently and proceed further?

(1) Naive Bayes: Check the type after ```documenttermmatrix=countvec.fit_transform(dataset['description'])```. The docs indicate it's dense, while a manual call of transform would be sparse. This sounds crazy and like a bad design-decision. Check it! (2) XGBoost: i think there are dense vectors used internally (for features?)! — sascha, Dec 03 '17 at 19:50
documenttermmatrix.shape Out[26]: (2346724, 12520)type(documenttermmatrix) Out[28]: scipy.sparse.csr.csr_matrix — Ranjana Girish, Dec 04 '17 at 02:07

score 6 · Accepted Answer · 2017-12-04T21:25:09.477

I think I can explain the problem in your code. The OS error appears to be:

"

ERROR_DS_RIDMGR_DISABLED
8263 (0x2047)

The directory service detected the subsystem that allocates relative identifiers is disabled. This can occur as a protective mechanism when the system determines a significant portion of relative identifiers (RIDs) have been exhausted.

" via https://msdn.microsoft.com/en-us/library/windows/desktop/ms681390

I think you exhausted a significant portion of the RIDs at this step in your code:

dataset['description'] = dataset['description'].apply(lambda x: 
list(set([lemmatizer.lemmatize(item,tag) for item in x])))

You're passing a lemmatizer in your lambda, but lambdas are anonymous, so it looks like you might be making 2.2 million copies of that lemmatizer at runtime.

You should try changing the low_memory flag to true whenever you have a memory issue.

Response to comment-

I checked the Pandas documentation, and you can define a function outside of dataset['description'].apply(), and then reference that function in the call to dataset['description'].apply(). Here is how I would write said function.

def lemmatize_descriptions(x):
return list(set([lemmatizer.lemmatize(item,tag) for item in x]))

Then, the call to apply() would be-

dataset['description'] = dataset['description'].apply(lemmatize_descriptions)

Here is the documentation.

i changed low_memory flag to true but still I got error `OSError: [WinError 541541187] Windows Error 0x20474343` — Ranjana Girish, Dec 04 '17 at 07:19
Is there any way to delete copy of data created by lambda function? — Ranjana Girish, Dec 04 '17 at 07:26
I added a response to your comment in an edit to the original answer. — , Dec 04 '17 at 21:23

text classification of large dataset in python

1 Answers1