One hot encoding training and test data

Question

I am working on the "House Prices - Advanced Regression Techniques" machine learning problem. They provide training data and test data. I have to create a model that will predict the house prices of the test set.

There are many features in my train and test set that are categorical. I used pd.get_dummies on my train set to make them all numerical. I also dropped some features, cleaned data, imputed data on my training set.

Once I train my model on this cleaned training data, can I use this same model to test on the Test-data? Keep in mind, I did not clean the test set at all. No one-hot-encoding, or removing columns or cleaning data like I did the training set. So I am assuming the model will not be able to evaluate the test data right?

So do I have to perform the same operations that I did on my training set on my test set as well?

score 1 · Answer 1 · answered Apr 12 '23 at 05:14

What you perform on the training set in terms of data processing you need to also do that on the testing set. Think you are essentially creating some function with a certain number of inputs x_1, x_2, ..., x_n. If you are missing some of these when you do get_dummies on the training set but not on the testing set than calling .predict(test), will not work. Hope that makes sense.

One hot encoding training and test data

1 Answers1