How to handle missing columns in test data?

Question

I have training data as like following

col1    col2    col3    col4    col5    Target
187.67  448.41  45.7    880070.41   1   -3
95.44   446.08  70.51   909069.06   4   120

I need to build a model and test with following data,

col1    col2    col3
45  2989    12
3   1111    121

The test data has only three column. I am planning to build a model with all the 5 columns as feature columns of traini data set.Is it good to build model only with three columns of train data set and use only 3 columns of test data for prediction? or is it good to build model with 5 columns of train dataset and preprocess(impute) col4 and col5 of test and then run a prediction?We have felt like col4 and and col5 are important. Please suggest the methodology to handle this?

score 1 · Answer 1 · answered Oct 28 '17 at 13:04

If you need to build model with 5 features then train a model which predict col 4 by using col1 ,col2,col3 and your target variable.Similarly perform the same thing for col5 also.You have to select the model by cross validation because you don't know what the test set outcome .It will help in little situations hope it helps.

score 0 · Answer 2 · answered Oct 28 '17 at 10:18

If you don't have the data, you don't have the data. If your col4 and col5 have more than 40-50% missing values then don't bother imputing and using them. Just make the model using the first 3 columns.

If you still feel the need to have it, then use a random forest model or something to predict on those missing values, using the 3 features and maybe the target. Use the values you have as training data and the values you don't have as test data. But you will never know if your model is predicting something sensible or just something without significance.

score 0 · Answer 3 · answered Oct 28 '17 at 19:53

Imputing the same constant value everywhere will certainly not help, but rather degrade performance.

As a rule of thumb, your input data should have the same characteristics, including missing data rate.

So most likely, you'll have to ignore the two extra columns in your training data.

How to handle missing columns in test data?

3 Answers3