To be specific, I am now working with a data with 100,000 rows and 20 features, my target variable is categorical so I use random forest classifier, Xgboost, LogisticRegression, etc.
I have a binary feature 'A'
, which in my dataframe only 20% of them are 1. But my future data will all come with feature 'A' == 1
. If I train my model with RFC, the importance of feature A is not very important. If I split my train/test set randomly, the AUC of my test set is 0.8, but if I use a subset of my test data with only 'A' == 1
, the AUC drop to 0.72.
Any one know what I should do in this situation?
I don't think I should drop all data with 'A' == 0
Asked
Active
Viewed 63 times
0
-
Do we need to keep feature A if you know that A is going to be all 1? How would you know that the future data will have only A == 1. If you insist on keeping A, then stratified sampling could work--this way your splits will try to have as representative population as possible. – user6461080 Jul 09 '20 at 18:54
-
Hi @user6461080, my model is trying to do something like sales prediction, my original data does not have feature A, but I found some of my data have significant different performance than others, I confirmed with my colleagues they told me that might because of an operation change during a specific time. So I add a flag called feature A. – Tong Shao Jul 09 '20 at 20:59
-
if you must use A, which still might not be the core issue, I recommend stratify split the train data and doing cross-fold validation. There's only 20% so each train set must have the same amount of A=1. If there are a high # of features, then having 1 operational flag called A wouldn't drastically change the model performance. I recommend reviewing your model methodology with a peer to see if there might be another issue – user6461080 Jul 09 '20 at 23:39