Training data has columns with all missing values but same columns in the Test data has some values, how to handle such situation?

Question

I have been given a training and test datasets separately. Both data sets have exactly same structure (same columns/features). There are some columns in the training data set that have missing values in all the rows. If I wanted to build a predictive model, I could simply delete those columns as they are not giving any information at all! But the trouble is that those same columns have some values in the test dataset. So, if I remove those columns from the training dataset, I will have to remove them from the test dataset too. I can do that too, but the problem is that the number of such columns are pretty large (about 150 out of total 250 columns). I’m very hesitant in removing those columns. Any idea or solution to preserve those columns would be really helpful. Thanks!

Short answer. Yes. Since those columns dont offer anything to learn how can they be used to infer. Please post this on https://stats.stackexchange.com as this is off-topic here. — Vivek Kumar, Nov 16 '17 at 09:26

score 0 · Answer 1 · answered Nov 16 '17 at 14:10

0

if your train/test data are appropriately split, then a useless column in one is useless in the other

alternatively, you can try to interpolate missing data

answered Nov 16 '17 at 14:10

Mohammad Athar

1,953
1
15
31

Training data has columns with all missing values but same columns in the Test data has some values, how to handle such situation?

1 Answers1