0

Please help, I am getting number of features error. The columns used are ID, encoded columns and integer column. This code works for another dataset with similar but more features. Is the number of features being used too small to get this error? This is my code:

from sklearn.model_selection import train_test_split
num_test = 0.20  # 80-20 split
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
clf = RandomForestClassifier()
parameters = {'n_estimators': [4, 6, 9],
              'max_features': ['log2', 'sqrt', 'auto'],
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10],
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1, 5, 8]
              }

acc_scorer = make_scorer(accuracy_score)

grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)

clf = grid_obj.best_estimator_
clf.fit(X_train, y_train)
ids = data_test['Id']
predictions = clf.predict(data_test.drop('Id', axis=1))
output = pd.DataFrame({'Id': ids, 'Full_Time_Home_Goals': predictions})

print(output.head())

The error I am getting is:

> Traceback (most recent call last):
>       File "C:/Users/harsh/PycharmProjects/Kaggle-Machine Learning from Start to Finish with Scikit-Learn/EPL Predicting.py", line 98, in
> <module>
>         predictions = clf.predict(data_test.drop('Id', axis=1))
>       File "C:\Users\harsh\PycharmProjects\GitHub\venv\lib\site-packages\sklearn\ensemble\_forest.py",
> line 629, in predict
>     ValueError: Number of features of the model must match the input. Model n_features is 4 and input n_features is 2

Even when I dont drop predictions = clf.predict(data_test.drop('Id', axis=1)) I still get the error

Sample dataset:

data_train:
   Id  HomeTeam  AwayTeam  Full_Time_Home_Goals
0   1        55       440                     3
1   2       158       493                     2
2   3       178       745                     1
3   4       185       410                     1
4   5       249        57                     2

data_test:
       Id  HomeTeam  AwayTeam
0  190748       284        54
1  190749       124       441
2  190750       446        57
3  190751       185       637
4  190752       749       482

The columns are the way it should work. Why is it not?

PyNoob
  • 3
  • 5
  • The line `predictions = clf.predict(data_test.drop('Id', axis=1))` should read `predictions = clf.predict(X_test)` – Sergey Bushmanov Sep 25 '20 at 21:51
  • 1
    @SergeyBushmanov should your comment be an answer? – rleir Sep 26 '20 at 00:43
  • @rleir that's not the answer as I am getting a model mismatch error. We are predicting test data right? Not train data, Sergey, that code is not working – PyNoob Sep 26 '20 at 01:52
  • `data_train` and `data_test` are not relevant to the code you've posted. You do not define them in the code and there is no need to use undefined variables either. Try what I have suggested and share any errors you get – Sergey Bushmanov Sep 26 '20 at 08:21
  • Where does `predictions = clf.predict(data_test.drop('Id', axis=1))` come from? You should delete it because you do not have `data_test` defined – Sergey Bushmanov Sep 26 '20 at 10:42
  • You train your model on `X_train` and predict trained model on `X_test` which you splitted earlier. This is it – Sergey Bushmanov Sep 26 '20 at 10:43
  • @SergeyBushmanov I am getting an error at `output = pd.DataFrame({'Id': ids, 'Full_Time_Home_Goals': predictions})` The error message is: `raise ValueError(msg) ValueError: array length 37921 does not match index length 380` – PyNoob Sep 26 '20 at 11:32
  • Your id's and predictions come from different places and have different length.... – Sergey Bushmanov Sep 26 '20 at 11:35
  • @SergeyBushmanov The same example works with another dataset (With more features / columns of values) why is mine not working? Can you please help in understanding/solving this? – PyNoob Sep 26 '20 at 12:40
  • There are some rules: data on which you do predictions should replicate dimensions and types of the data, on which the model was trained. If you want to put several series in a df, they must be of the same length. If you still have some problems, try to cut down your example to a start to finish [reprex], together with input data and problem description, and try to ask one question at a time. – Sergey Bushmanov Sep 26 '20 at 12:44
  • I showed you how to run your code snippet. Then, as far as I can judge, you try to feed different data, which I do not understand where it comes from. I cannot help you with data I know nothing of. – Sergey Bushmanov Sep 26 '20 at 12:49
  • @SergeyBushmanov, sorry, could not get to chat because I dont have enough points.. I have posted a fully reproduceable example [here](https://stackoverflow.com/questions/64078945/how-do-i-resolve-number-of-features-errorfull-question-because-i-dont-have-eno) I would really appreciate the help – PyNoob Sep 26 '20 at 14:42
  • The new question linked in the previous comment https://stackoverflow.com/questions/64078945/how-do-i-resolve-number-of-features-errorfull-question-because-i-dont-have-eno contains the answer to this question. We should close this question. – rleir Sep 28 '20 at 11:05
  • @rleir How do I close this question? – PyNoob Sep 29 '20 at 06:36
  • @PyNoob Sorry, I don't know. – rleir Sep 29 '20 at 21:15

0 Answers0