0

I'm working on a dataset of around 20000 rows. The aim is to predict whether a person has been hired by a company or not given some features like gender, experience, application date, test score, job skill, etc. The dataset is imbalanced: the classes are either '1' or '0' (hired / not hired) with ratio 1:10.

I chose to train a Random Forest Classifier to work on this problem. I splitted the dataset 70%-30% randomly into a training set and a test set.

After careful reading of the different options to tackle the imbalance problem (e.g. Dealing with the class imbalance in binary classification, Unbalanced classification using RandomForestClassifier in sklearn) I got stuck on getting a good score on my test set.

I tried several things:

  • I trained three different random forests on the whole X_train, on undersampled training X_und and on oversampled X_sm respectively. X_und was generated by simply cutting down at random the rows of X_train labelled by 0 to get 50-50, 66-33 or 75-25 ratios of 0s and 1s; X_sm was generated by SMOTE.

Using scikit-learn GridSearchCV i tweaked the three models to get the best parameters:

param_grid = {'min_samples_leaf':[3,5,7,10,15],'max_features':[0.5,'sqrt','log2'],
          'max_depth':[10,15,20],
          'class_weight':[{0:1,1:1},{0:1,1:2},{0:1,1:5},'balanced'],
          'criterion':['entropy','gini']}

sss = StratifiedShuffleSplit(n_splits=5)
grid = GridSearchCV(RandomForestClassifier(),param_grid,cv=sss,verbose=1,n_jobs=-1,scoring='roc_auc')
grid.fit(X_train,y_train)

The best score was obtained with

rfc = RandomForestClassifier(n_estimators=150, criterion='gini', min_samples_leaf=3, 
max_features=0.5, n_jobs=-1, oob_score=True, class_weight={0:1,1:5})

trained on the whole X_train and giving classification report on the test set

          precision    recall  f1-score   support

       0     0.9397    0.9759    0.9575      5189
       1     0.7329    0.5135    0.6039       668

micro avg     0.9232    0.9232    0.9232      5857
macro avg     0.8363    0.7447    0.7807      5857
weighted avg     0.9161    0.9232    0.9171      5857

With the sampling methods I got similar results, but no better ones. Precision went down with the undersampling and I got almost the same result with the oversampling.

For undersampling:

    precision    recall  f1-score   support

 0     0.9532    0.9310    0.9420      5189
 1     0.5463    0.6452    0.5916       668

For SMOTE:

    precision    recall  f1-score   support

 0     0.9351    0.9794    0.9567     5189
 1     0.7464    0.4716    0.5780      668
  • I played with the parameter class_weights to give more weight to the 1s and also with sample_weight in the fitting process.
  • I tried to figure out which score to take into account other than accuracy. Running the GridSearchCV to tweak the forests, I used different scores, focusing especially on f1 and roc_auc hoping to decrease the False Negatives. I got great scores with the SMOTE-oversampling, but this model did not generalize well on the test set. I wasn't able to understand how to change the splitting criterion or the scoring for the random forest in order to lower the number of False Negatives and increase the Recall for the 1s. I saw that cohen_kappa_score is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch.
  • Select only the most important features, but this did not change the result, on the contrary it got worse. I remarked that feature importance obtained from training a RF after SMOTE was completely different from the normal sample one.
  • I don't know exactly what to do with the oob_score other than considering it as a free validation score obtained when training the forests. With the oversampling I get the highest oob_score = 0.9535 but this is kind of natural since the training set is in this case balanced, the problem is still that it does not generalize well to the test set.

Right now I ran out of ideas, so I would like to know if I'm missing something or doing something wrong. Or should I just try another model instead of Random Forest?

Rohan Nadagouda
  • 462
  • 7
  • 18
bruco
  • 141
  • 1
  • 12
  • Try other algorithms like `SVM`, `XGB`, `LGB` or `logistic regression` etc and compare the results. – Sociopath Dec 20 '18 at 12:35
  • What is your imbalance level? – user2974951 Dec 20 '18 at 12:43
  • @user2974951 the imbalance level is 1:10 of 1s vs 0s – bruco Dec 20 '18 at 13:55
  • Show us the results of all the tried methods (confusion matrix). – user2974951 Dec 20 '18 at 13:58
  • @Sociopath my question was more about how I can improve the score with RF. I've tried logistic regression and SVM, and I get bad results: they behave like trivial all 0s classifiers. – bruco Dec 20 '18 at 13:58
  • @user2974951 confusion matrix of rfc: `[[5064 125] [ 325 343]]`. For the oversampling `[[5082 107] [ 353 315]]` and for the undersampling `[[4831 358] [ 237 431]]` – bruco Dec 20 '18 at 14:05
  • Please use percentages (or ratios) and put the results in your question so that we can better see. – user2974951 Dec 20 '18 at 14:06
  • @user2974951 added the classification reports also for undersampling and oversampling – bruco Dec 20 '18 at 14:16
  • The situation is not hopeless, your models are learning something, although it's not great. More tuning is probably required. As already mentioned, try some other models first, if they don't improve much then you are going to have to get knee deep in model tuning. Also, why do you have two F1-scores for each confusion matrix? You should have only 1. – user2974951 Dec 20 '18 at 14:26
  • @user2974951 Notice that the tables I printed are generated by `classification_report(y_true, y_pred)`. I have two F1-scores calculated according to the precision and recall of the predictions for label 0 and for label 1. For label 0 they are maybe called differently, like fall-out and false omission rate. – bruco Dec 20 '18 at 14:44
  • 'I saw that cohen_kappa_score is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch.' CKS is good but for very imbalanced datasets. Can be used, but accuracy function should be written by hands. – avchauzov Dec 21 '18 at 03:07
  • PLease attach the code for GridSearchCV. And also please attach the parameters you are taking for grid. – avchauzov Dec 21 '18 at 03:12
  • @avchauzov here you are – bruco Dec 21 '18 at 10:30
  • @bruco I advise you firstly to change your param_grid to: param_grid = {'min_samples_leaf':[1,2,3,5,7,10,15],'max_features':[0.5,'sqrt','log2', None], 'max_depth':[10,15,20, None], 'class_weight':[{0:1,1:1},{0:1,1:2},{0:1,1:5},'balanced', None], 'criterion':['entropy','gini'], 'oob_score': [True, False]} I added None values for some parameters, 1,2 for max_samples_leaf and oob_score. As I remember, class_weights parameter 'balanced' works not good when you make over- or down-samplings. You can also add some more values to the max_depths. – avchauzov Dec 22 '18 at 08:10
  • I.e. parameters range is pretty rough so you do not see a difference. – avchauzov Dec 22 '18 at 08:11
  • Also, must say, n_splits parameter may have a big influence on some cases. – avchauzov Dec 22 '18 at 08:54

0 Answers0