-1

I've worked around with Scikit Learn Library for Machine Learning purpose. I got some problem related to Dummy variable while using Regression. I have 2 set of sample for Training set and Test set. Actually, program uses Training set to create "Prediction Model", then "Testing" to check the score. While running the program, If the shape is equal, it's fine. But dummy variable, will make change to the shape and lead to different in shape.

Example

  • Training set: 130 Rows * 3 Column

  • Training set: 60 Rows * 3 Column

After making 1 and 2 column to be dummy, now shape is changing

  • Training set: 130 Rows * 15 Column

  • Training set: 60 Rows * 12 Column

Any solution to solve this problem? If it's possible or not, to success in progress even data shape is different

Sample Program: https://www.dropbox.com/s/tcc1ianmljf5i8c/Dummy_Error.py?dl=0

aydinugur
  • 1,208
  • 2
  • 14
  • 21
Stev Jane
  • 53
  • 10

1 Answers1

2

If I understand your code correctly, you are using pd.get_dummies to create the dummy variables and are passing your entire data frame to the function.

In this case, pandas will create a dummy variable for every value in every category it finds. In this case, it looks like more category values exist in training than in test. This is why you end up with more columns in training than in test.

A better approach is to combine everything in one dataframe, create categorical variables in the combined data set and then split your data into train and test.

amanbirs
  • 1,078
  • 6
  • 11
  • It's such a good approach I used to consider about it, but let's imagine, after I create prediction model. I got more record and unfortunately, some record is not existing in training sample again. Do I need to combine and do prediction over again and again? – Stev Jane Nov 15 '17 at 07:53
  • 1
    @StevJane Thats a problem you need to consider and a pretty obvious one in real world scenarios. Dont think of that as new categories in a column. Just think of new records as new data. Either you can discard whats not matching the training data or train it again on combined data. – Vivek Kumar Nov 15 '17 at 08:52
  • 1
    @VivekKumar is absolutely right. Also, if your new data consistently has categories that were not available in your training data then your training data is not representative of real world data. Which is a much bigger problem anyway – amanbirs Nov 15 '17 at 08:54