High precision recall for train data but very poor for test data in classification problem

Question

I'm very new to ML and I'm trying to build a classifier for unbalanced binary class for a real life problem. I've tried various models like Logistic regression, Random Forest, ANN, etc but every time I'm getting very high precision and recall (around 94%) for train data and very poor (around 1%) for test or validation data. I've 53 features and 97094 data points. I tried tweaking hyper-parameters but as far as I understand, with current precision and recall for test and validation data, it will also not help significantly. Can anyone please help me understand what could have gone wrong. Thank you.

rf = RandomForestClassifier(bootstrap=True, class_weight={1:0.80,0:0.20}, criterion='entropy',
                       max_depth=2, max_features=4, 
                       min_impurity_decrease=0.01, min_impurity_split=None,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=-1, oob_score=False, random_state=41, verbose=0,
                       warm_start=False)
rf.fit(X_train, y_train)

I've added the scatter plot for train and test class distribution which might be helpful in understanding the problem — vishnu priya, Mar 10 '20 at 21:55
What do these scatter plots represent? There are missing labels for the x- and y-axis. It might be very helpful to see actual code from you, i.e. how do you load the data, what are your preprocessing steps and how do you train your model with them. Just make a simple example with a Random Forrest. It will help a lot to help you further. — mrzo, Mar 10 '20 at 22:54
It is not possible for me to add preprocessing steps. The blue correlations chart is with the target variable and the heat map is for the selected features that I'm using in Random Forest. In order to make balanced class , I've randomly selected class 0 data points — vishnu priya, Mar 11 '20 at 11:01
the results you are giving, are the first ones from testing and the second ones from training? Why are you using only 149 samples for testing if you have actually over 80000? — mrzo, Mar 12 '20 at 00:10
Yes first one is test result and second is train. I need to do prediction at day level that is why I'm training on the historical data and trying to test for a day. — vishnu priya, Mar 12 '20 at 03:25
In general, have you tried using more testing data (i.e. more days)? It could be that this is a very special (outlier) day for example. In other terms: did you check whether the testing day is similar to the rest of your training data? — mrzo, Mar 12 '20 at 04:34

mrzo · Answer 1 · 2020-03-09T06:03:06.547

It is difficult to say without seeing your actual data or your code that you are using but your models are probably overfitting to your training dataset or to your majority class.

If your model overfits your training dataset, it learns to memorise your actual training dataset. It does not find any general distinctions to classify your data anymore but it adapts its classification boundaries very closely to the training data. You should consider using less complex models (e.g. limit the number of trees in Random Forest), drop some features (e.g. start using only 3 of 53 features), regularisation or data augmentation. See here for more techniques against training data overfitting and here for an example of over- and underfitting.

If your model simply overfits to your majority class (e.g. 99% of your data has the same class), then you could try to oversample the minority class during your training.

I'm working on the points mentioned by you. Can you please have a look at the scatter plots and let me know if there is anything else also that I should look into? I hope the images will be able to provide some extra information about the problem. — vishnu priya, Mar 10 '20 at 22:06

score 0 · Answer 2 · answered Mar 09 '20 at 10:11

You're likely overfitting the model due to good training performance but poor test performance, this tells me that your model cannot generalize good enough and should be simplified. Like @mrzo said - you've way too many features, so look into dimensionality reduction algorithms and apply those for your dataset prior to the application of your model. Another good place to start is to run tree classifiers' "feature importance" methods to see what actually matters in a given dataset. Without looking at your model and dataset - it is just speculation though.

High precision recall for train data but very poor for test data in classification problem

2 Answers2