2

This might look like a trivial problem. But I am getting stuck in predicting results from a model. My problem is like this:

I have a dataset of shape 1000 x 19 (except target feature) but after one hot encoding it becomes 1000 x 141. Since I trained the model on the data which is of shape 1000 x 141, so I need data of shape 1 x 141 (at least) for prediction. I also know in python, I can make future prediction using

model.predict(data)

But, since I am getting data from an end user through a web portal which is shape of 1 x 19. Now I am very confused how should I proceed further to make predictions based on the user data.

How can I convert data of shape 1 x 19 into 1 x 141 as I have to maintain the same order with respect to train/test data means the order of column should not differ? Any help in this direction would be highly appreciated.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • If you're using the latest sklearn version of OneHotEncoder, you can simply use the built-in [inverse transform](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.inverse_transform) method – G. Anderson May 14 '19 at 15:25
  • you can check this: https://stackoverflow.com/questions/54786266/prediction-after-one-hot-encoding – Anubhav Singh May 14 '19 at 15:30
  • @G.Anderson. can you please explain a bit using any example. how can I use inverse-transform. I read documentation but whole picture is still not clear to me. And it can address my issue if I Have data of shape 1x141 – Girijesh Singh May 14 '19 at 15:59
  • Upon a second reading, I am a bit confused as to exactly what your question is. Are you asking how to transform a new user input with the same dimension as you un-transformed data in order to predict, or how to reverse the transformed prediction back to the original dimension for display to the user? – G. Anderson May 14 '19 at 16:03
  • No, My question is this. there are 19 feature in a dataset. I applied one hot encoding and after which number of feature become 141. Now the model is getting trained with all 141 feature. So for prediction it will require 141 feature as well. But now the feature are coming from an end user who knows there are only 19 features. Now my problem is that how can I make prediction using these 19 features as model will ask for 141 features. Also I cannot apply one hot encoding one user data as it has one row and 19 columns. – Girijesh Singh May 14 '19 at 16:05
  • @AnubhavSingh Your suggestion is somewhat relevant but I have only pickle model which I can use for prediction. Any other suggestion please as I am working to deploy this thing in production. – Girijesh Singh May 14 '19 at 16:12
  • @GirijeshSingh go through the answer carefully. Your problem's solution is in the answer itself. BTW there are a lot of things matter, like whether your are doing dimensionality reduction, how you are doing one-hot encoding, which algo you are using-which also decides which of all features are significant, etc... – Anubhav Singh May 14 '19 at 17:44
  • Why don't you transform your features of your user data from 19 to 141 features using the same technique (one hot encoding) that you used for your training data when you feed it to predict function? – manishm May 14 '19 at 15:49
  • thank you for your response .but please don't mind. Don't you think this is not an answer. You can suggest such things in comment rather answer. BTW maintaining alignment with previous data is require which is not possible by your suggested method – Girijesh Singh May 14 '19 at 15:53

1 Answers1

6

I am assuming that to create a one hot encoding, you are using sklearn onehotencoder. If you using that, then the problem should be solved easily. Since you are fitting the one hot encoder on your training data

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories = "auto", handle_unknown = "ignore")
X_train_encoded = encoder.fit_transform(X_train)

So now in the above code, your encoder is fitted on your training data so when you get the test data, you can transform it into the same encoded data using this fitted encoder.

test_data = encoder.transform(test_data)

Now your test data will also be of 1x141 shape. You can check shape using

(pd.DataFrame(test_data.toarray())).shape
secretive
  • 2,032
  • 7
  • 16
  • Thank you for your response !!. It's working fine. Just a small edit OneHotEncoder instead of OneHotEncode in this line encoder = OneHotEncode(categories = "auto", handle_unknown = "ignore") – Girijesh Singh May 16 '19 at 07:58
  • HEY @rajat kabra Do I need to perform label encoding as well before applying one hot encoder If I have many categorical string values in a column. – Girijesh Singh May 16 '19 at 13:14
  • No, you dont have to do label encoding before one hot. Label encoding can be done instead of one hot as it assigns a unique number to every category which can be used in ML algos. – secretive May 16 '19 at 14:41
  • Thanks @rajat kabra . I have one more question. This fit_transform is converting 39 input features (1000x39) to 5090 features (1000X5090). Any suggestion on this. Please help – Girijesh Singh May 16 '19 at 15:40
  • Thats not good. Thats just curse of dimensionality. There are too many features. It is not a good idea to one hot encode your entire data. Try feature selection to reduce number of features before one hot encoding. There might be some features which might be better to deal with using label encoding rather than one hot – secretive May 16 '19 at 15:54