I have two data frames:
df_train
Data types in the dataset: ['uint8', 'int64', 'float64']
Number of features: 233
Shape: (1457, 233)
df_test
Data types in the dataset: ['uint8', 'int64', 'float64']
Number of features: 216
Shape: (1447, 216)
The difference in the number of columns (233 vs 216) is due to the dummy variables I created in them both with pd.get_dummies()
- fewer were created in df_test
. Prior to that, df_train
originally contained just one extra variable "SalePrice" which is the target variable to be predicted on df_test
.
X = df_train.drop(["SalePrice"], axis=1)
y = df_train["SalePrice"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.linear_model import Ridge
ridge = Ridge().fit(X_train, y_train)
This results in a solid test set score and all's well. But when I try to predict on df_test
as follows
y_pred = ridge.predict(df_test)
it gives the following error:
ValueError: shapes (1447,216) and (232,) not aligned: 216 (dim 1) != 232 (dim 0)
I understand I messed up with the different shapes. Can you help me manage this problem? I have 0 experience in this. Much appreciated.