sk learn, what to do when the data I want to predict has different distribution with data I have right now

Question

To be specific, I am now working with a data with 100,000 rows and 20 features, my target variable is categorical so I use random forest classifier, Xgboost, LogisticRegression, etc. I have a binary feature 'A', which in my dataframe only 20% of them are 1. But my future data will all come with feature 'A' == 1. If I train my model with RFC, the importance of feature A is not very important. If I split my train/test set randomly, the AUC of my test set is 0.8, but if I use a subset of my test data with only 'A' == 1, the AUC drop to 0.72. Any one know what I should do in this situation? I don't think I should drop all data with 'A' == 0

Do we need to keep feature A if you know that A is going to be all 1? How would you know that the future data will have only A == 1. If you insist on keeping A, then stratified sampling could work--this way your splits will try to have as representative population as possible. — user6461080, Jul 09 '20 at 18:54
Hi @user6461080, my model is trying to do something like sales prediction, my original data does not have feature A, but I found some of my data have significant different performance than others, I confirmed with my colleagues they told me that might because of an operation change during a specific time. So I add a flag called feature A. — Tong Shao, Jul 09 '20 at 20:59
if you must use A, which still might not be the core issue, I recommend stratify split the train data and doing cross-fold validation. There's only 20% so each train set must have the same amount of A=1. If there are a high # of features, then having 1 operational flag called A wouldn't drastically change the model performance. I recommend reviewing your model methodology with a peer to see if there might be another issue — user6461080, Jul 09 '20 at 23:39

sk learn, what to do when the data I want to predict has different distribution with data I have right now

0 Answers0