Please help, I am getting number of features error. The columns used are ID, encoded columns and integer column. This code works for another dataset with similar but more features. Is the number of features being used too small to get this error? This is my code:
from sklearn.model_selection import train_test_split
num_test = 0.20 # 80-20 split
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
clf = RandomForestClassifier()
parameters = {'n_estimators': [4, 6, 9],
'max_features': ['log2', 'sqrt', 'auto'],
'criterion': ['entropy', 'gini'],
'max_depth': [2, 3, 5, 10],
'min_samples_split': [2, 3, 5],
'min_samples_leaf': [1, 5, 8]
}
acc_scorer = make_scorer(accuracy_score)
grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)
clf = grid_obj.best_estimator_
clf.fit(X_train, y_train)
ids = data_test['Id']
predictions = clf.predict(data_test.drop('Id', axis=1))
output = pd.DataFrame({'Id': ids, 'Full_Time_Home_Goals': predictions})
print(output.head())
The error I am getting is:
> Traceback (most recent call last):
> File "C:/Users/harsh/PycharmProjects/Kaggle-Machine Learning from Start to Finish with Scikit-Learn/EPL Predicting.py", line 98, in
> <module>
> predictions = clf.predict(data_test.drop('Id', axis=1))
> File "C:\Users\harsh\PycharmProjects\GitHub\venv\lib\site-packages\sklearn\ensemble\_forest.py",
> line 629, in predict
> ValueError: Number of features of the model must match the input. Model n_features is 4 and input n_features is 2
Even when I dont drop predictions = clf.predict(data_test.drop('Id', axis=1))
I still get the error
Sample dataset:
data_train:
Id HomeTeam AwayTeam Full_Time_Home_Goals
0 1 55 440 3
1 2 158 493 2
2 3 178 745 1
3 4 185 410 1
4 5 249 57 2
data_test:
Id HomeTeam AwayTeam
0 190748 284 54
1 190749 124 441
2 190750 446 57
3 190751 185 637
4 190752 749 482
The columns are the way it should work. Why is it not?