0

I have two data frames:

df_train

Data types in the dataset:  ['uint8', 'int64', 'float64']
Number of features:  233
Shape:  (1457, 233)

df_test

Data types in the dataset:  ['uint8', 'int64', 'float64']
Number of features:  216
Shape:  (1447, 216)

The difference in the number of columns (233 vs 216) is due to the dummy variables I created in them both with pd.get_dummies() - fewer were created in df_test. Prior to that, df_train originally contained just one extra variable "SalePrice" which is the target variable to be predicted on df_test.

X = df_train.drop(["SalePrice"], axis=1)
y = df_train["SalePrice"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.linear_model import Ridge
ridge = Ridge().fit(X_train, y_train)

This results in a solid test set score and all's well. But when I try to predict on df_test as follows

y_pred = ridge.predict(df_test)

it gives the following error:

ValueError: shapes (1447,216) and (232,) not aligned: 216 (dim 1) != 232 (dim 0)

I understand I messed up with the different shapes. Can you help me manage this problem? I have 0 experience in this. Much appreciated.

P. Prunesquallor
  • 561
  • 1
  • 10
  • 26
  • 2
    Did you say that `df_test` does not have the dummies created for `X_train`? If so, that is your problem-you need the same set of features on the test data as the data used to train your model, otherwise the prediction will fail. Therefore process your `df_test` to produce the same format as the training (i.e. run your dummy creation on `df_test` as per your comment above. – ags29 Nov 01 '17 at 18:02
  • Yes, actually it's 16 dummies short. What should I do, simply add columns with those names with all values 0? (This just occured to me, if that's the right answer, apologies for posting) – P. Prunesquallor Nov 01 '17 at 18:04
  • How can I ensure that all the same features are in `df_test` without doing it manually with a for loop? (if another way is at all possible) – P. Prunesquallor Nov 01 '17 at 18:05
  • Re-reading your question, I guess your issue may be that the test set variables have fewer levels than the training variables. One way to tackle that could be to concat the train and test data (after creating an identifier column) and then run `get_dummies`. You can then split into train and test again using the identifier and you should have the same levels. – ags29 Nov 01 '17 at 18:07
  • No no, actually you were right with the first comment: I just had to add the missing dummies to df_test and set them all to 0. Now it's working. Thank you very much:) – P. Prunesquallor Nov 01 '17 at 18:13
  • 1
    In this case, test has fewer dummies than train. What will you do when test has more columns. Will you then remove columns from test? I would advise to run `get_dummies()` on all the data before splitting into train and test. Looks similar to this question: https://stackoverflow.com/q/47061707/3374996 – Vivek Kumar Nov 02 '17 at 06:51

0 Answers0