2

I am working on a project in which I am dealing with a large dataset.

I need to train the SVM classifier within the KFold cross-validation library from Sklearn.

import pandas as pd
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score


x__df_chunk_synth = pd.read_csv('C:/Users/anujp/Desktop/sort/semester 4/ATML/Sem project/atml_proj/Data/x_train_syn.csv')
y_df_chunk_synth = pd.read_csv('C:/Users/anujp/Desktop/sort/semester 4/ATML/Sem project/atml_proj/Data/y_train_syn.csv')

svm_clf = svm.SVC(kernel='poly', gamma=1, class_weight=None, max_iter=20000, C = 100, tol=1e-5)
X = x__df_chunk_synth
Y = y_df_chunk_synth
scores = cross_val_score(svm_clf, X, Y,cv = 5, scoring = 'f1_weighted')
print(scores)
    
pred = svm_clf.predict(chunk_test_x)
accuracy = accuracy_score(chunk_test_y,pred)

print(accuracy)

I am using the above-mentioned code. I understand that I am training my classifier within the function of cross_val_score and hence whenever I am trying to call the classifier outside for the prediction on test data, I am getting an error:

sklearn.exceptions.NotFittedError: This SVC instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Is there any other option of doing the same thing in the correct way?

Please help me with this issue.

yatu
  • 86,083
  • 12
  • 84
  • 139

1 Answers1

3

Indeed model_selection.cross_val_score uses the input model to fit the data, so it doesn't have to be fitted. However, it does not fit the actual object used as input, rather a copy of it, hence the error This SVC instance is not fitted yet... when trying to predict.

Looking into the source code in cross_validate which is called in cross_val_score, in the scoring step, the estimator goes through clone first:

scores = parallel(
    delayed(_fit_and_score)(
        clone(estimator), X, y, scorers, train, test, verbose, None,
        fit_params, return_train_score=return_train_score,
        return_times=True, return_estimator=return_estimator,
        error_score=error_score)
    for train, test in cv.split(X, y, groups))

Which creates a deep copy of the model (which is why the actual input model is not fitted):

def clone(estimator, *, safe=True):
    """Constructs a new estimator with the same parameters.
    Clone does a deep copy of the model in an estimator
    without actually copying attached data. It yields a new estimator
    with the same parameters that has not been fit on any data.
    ...
yatu
  • 86,083
  • 12
  • 84
  • 139
  • Thanks a lot for the reply. So what I understood is that, We use KFold validation for hyperparameter tuning by seeing the scores. Once we have achieved the best parameters, we need to use that parameter and create another classifier for training on the train data. Then this trained classifier can be used further for predicting test data. Please correct me if I understood it wrong. – Raj Rajeshwari Prasad Jul 04 '20 at 21:27
  • Yes, most generally you should use a [GridSearch](https://scikit-learn.org/stable/modules/grid_search.html) for the fine tuning of the model to obtain the best parameters. Then fit a new classifier and predict on unseen (test) data. @raj – yatu Jul 04 '20 at 21:36
  • Thank you very much for explaining this. How can we get the parameters after running the cross_val_score to create another classifier for predictions on new data if the input model is not being fitted? Do you have any example code to show how this can be done? – ptn77 Dec 26 '22 at 16:47