0

I have a dataset that I spilt by the holdout method using sklearn. The following is the procedure

from sklearn.model_selection import train_test_split
(X_train, X_test, y_train, y_test)=train_test_split(X,y,test_size=0.3, stratify=y)

I am using Random forest as classifier. The following is the code for that

clf = RandomForestClassifier(random_state=0 )
clf.fit(X_train, y_train)
R_y_pred = clf.predict(X_test)
target_names = ['Alive', 'Dead']
print(classification_report(y_test, R_y_pred, target_names=target_names))

Now I would like to use stratified kfold cross-validation on the training set. The code that I have written for that

cv_results = cross_validate(clf, X_train, y_train, cv=5)
R_y_pred = cv_results.predict(X_test)
target_names = ['Alive', 'Dead']
print(classification_report(y_test, R_y_pred, target_names=target_names))

I got error as cv_results has no attribute like predict.

I would like to know how could I print the classification result after using k fold cross validation.

Thank you.

Encipher
  • 1,370
  • 1
  • 14
  • 31

1 Answers1

1

The cv_results is simply returning scores that demonstrate how well the model performs in predicting data across split samples (5 as specified in this case).

It is not a model that can be used for prediction purposes.

For instance, when considering a separate problem of predicting hotel cancellations using a classification model, using 5-fold cross validation with a random forest classifier yields the following test scores:

>>> from sklearn.model_selection import cross_validate
>>> cv_results = cross_validate(clf, x1_train, y1_train, cv=5)
>>> cv_results

{'fit_time': array([1.09486771, 1.13821363, 1.11560798, 1.08220959, 1.06806993]),
 'score_time': array([0.07809329, 0.10946631, 0.09018588, 0.07582998, 0.07735801]),
 'test_score': array([0.84440007, 0.85172242, 0.85322017, 0.84656349, 0.84190381])}

However, when attempting to make predictions using this model, the same error message is returned:

>>> from sklearn.model_selection import cross_validate
>>> cv_results = cross_validate(clf, x1_train, y1_train, cv=5)
>>> cv_results
>>> R_y_pred = cv_results.predict(x1_val)
>>> print(classification_report(y_test, R_y_pred))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[33], line 4
      2 cv_results = cross_validate(clf, x1_train, y1_train, cv=5)
      3 cv_results
----> 4 R_y_pred = cv_results.predict(x1_val)
      5 print(classification_report(y_test, R_y_pred))

AttributeError: 'dict' object has no attribute 'predict'
Michael Grogan
  • 973
  • 5
  • 10
  • Ok, Then how could I get the prediction result over test dataset when using cross validation. – Encipher Jun 22 '23 at 18:58
  • 1
    You are still making predictions using the original model - R_y_pred = clf.predict(X_test). Cross validation is simply a means of seeing how accurate the model is when splitting the data and testing across different iterations - with the aim of making the predictions less biased than simply using a standard train-test split. I recommend the following resource for more information: https://machinelearningmastery.com/k-fold-cross-validation/ – Michael Grogan Jun 22 '23 at 19:09
  • I did prediction using R_y_pred = clf.predict(X_test) and there is no change in the classification report. Then why should I apply it? – Encipher Jun 22 '23 at 19:37
  • Well, you are applying it to determine how well the model would perform across unseen data. e.g. you might find that the model makes strong predictions on your test set by chance - but then performs poorly in the real world. k-fold cross validation minimises this risk by varying the training and test sets to analyse if the test set scores remain strong in all cases. e.g. if you were to get a high test score on one iteration but a lower one on another, then the model may not have as strong a predictive power as originally indicated. – Michael Grogan Jun 22 '23 at 19:49