I'm working on a dataset of around 20000 rows. The aim is to predict whether a person has been hired by a company or not given some features like gender, experience, application date, test score, job skill, etc. The dataset is imbalanced: the classes are either '1' or '0' (hired / not hired) with ratio 1:10.
I chose to train a Random Forest Classifier to work on this problem. I splitted the dataset 70%-30% randomly into a training set and a test set.
After careful reading of the different options to tackle the imbalance problem (e.g. Dealing with the class imbalance in binary classification, Unbalanced classification using RandomForestClassifier in sklearn) I got stuck on getting a good score on my test set.
I tried several things:
- I trained three different random forests on the whole X_train, on undersampled training X_und and on oversampled X_sm respectively. X_und was generated by simply cutting down at random the rows of X_train labelled by 0 to get 50-50, 66-33 or 75-25 ratios of 0s and 1s; X_sm was generated by SMOTE.
Using scikit-learn GridSearchCV
i tweaked the three models to get the best parameters:
param_grid = {'min_samples_leaf':[3,5,7,10,15],'max_features':[0.5,'sqrt','log2'],
'max_depth':[10,15,20],
'class_weight':[{0:1,1:1},{0:1,1:2},{0:1,1:5},'balanced'],
'criterion':['entropy','gini']}
sss = StratifiedShuffleSplit(n_splits=5)
grid = GridSearchCV(RandomForestClassifier(),param_grid,cv=sss,verbose=1,n_jobs=-1,scoring='roc_auc')
grid.fit(X_train,y_train)
The best score was obtained with
rfc = RandomForestClassifier(n_estimators=150, criterion='gini', min_samples_leaf=3,
max_features=0.5, n_jobs=-1, oob_score=True, class_weight={0:1,1:5})
trained on the whole X_train and giving classification report on the test set
precision recall f1-score support
0 0.9397 0.9759 0.9575 5189
1 0.7329 0.5135 0.6039 668
micro avg 0.9232 0.9232 0.9232 5857
macro avg 0.8363 0.7447 0.7807 5857
weighted avg 0.9161 0.9232 0.9171 5857
With the sampling methods I got similar results, but no better ones. Precision went down with the undersampling and I got almost the same result with the oversampling.
For undersampling:
precision recall f1-score support
0 0.9532 0.9310 0.9420 5189
1 0.5463 0.6452 0.5916 668
For SMOTE:
precision recall f1-score support
0 0.9351 0.9794 0.9567 5189
1 0.7464 0.4716 0.5780 668
- I played with the parameter
class_weights
to give more weight to the 1s and also withsample_weight
in the fitting process. - I tried to figure out which score to take into account other than accuracy. Running the
GridSearchCV
to tweak the forests, I used different scores, focusing especially onf1
androc_auc
hoping to decrease the False Negatives. I got great scores with the SMOTE-oversampling, but this model did not generalize well on the test set. I wasn't able to understand how to change the splitting criterion or the scoring for the random forest in order to lower the number of False Negatives and increase the Recall for the 1s. I saw thatcohen_kappa_score
is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch. - Select only the most important features, but this did not change the result, on the contrary it got worse. I remarked that feature importance obtained from training a RF after SMOTE was completely different from the normal sample one.
- I don't know exactly what to do with the
oob_score
other than considering it as a free validation score obtained when training the forests. With the oversampling I get the highestoob_score = 0.9535
but this is kind of natural since the training set is in this case balanced, the problem is still that it does not generalize well to the test set.
Right now I ran out of ideas, so I would like to know if I'm missing something or doing something wrong. Or should I just try another model instead of Random Forest?