Sklearn: how to get mean squared error on classifying training data

Question

I'm trying to do some classification problems using sklearn for the first time in Python, and was wondering what was the best way to go about calculating the error of my classifier (like a SVM) solely on the training data.

My sample code for calculating accuracy and rmse are as follows:

    svc = svm.SVC(kernel='rbf', C=C, decision_function_shape='ovr').fit(X_train, y_train.ravel())
    prediction = svc.predict(X_test)
    svm_in_accuracy.append(svc.score(X_train,y_train))
    svm_out_rmse.append(sqrt(mean_squared_error(prediction, np.array(list(y_test)))))
    svm_out_accuracy.append((np.array(list(y_test)) == prediction).sum()/(np.array(list(y_test)) == prediction).size)

I know from 'sklearn.metrics import mean_squared_error' can pretty much get me the MSE for an out-of-sample comparison. What can I do in sklearn to give me an error metric on how my well/not well my model misclassified on the training data? I ask this because I know my data is not perfectly linearly separable (which means the classifier will misclassify some items), and I want to know the best way to get an error metric on how much it was off. Any help would be appreciated!

For classification. you can use **accuracy, recall and precision** . — Sociopath, Feb 02 '18 at 08:41
This is very broad, as in depending on your specific question, and not a sklearn question. First of all RMSE is only for regression. For classification use the metrics by @AkshayNevrekar, or additionally the AUC, or the Log-Loss. It may be useful to actually study the confusion matrix or the ROC-ruve. But this really depends on your problem (number of classes, balance of classes, are false positives or false negatives more of a problem, etc.). Sklearn does support all of the named metrics, see [here](http://scikit-learn.org/stable/modules/model_evaluation.html) — Marcus V., Feb 02 '18 at 16:08

score 1 · Accepted Answer · answered Feb 02 '18 at 11:16

To evaluate you classifier you can use the following metrics:

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

The confusion matrix has the predicted labels as columns headings and the true labels are row labels. The main diagonal of the confusion matrix shows the number of correctly assigned labels. Any off-diagonal elements contain the number of incorrectly assigned labels. From the confusion matrix, you can also calculate accuracy, precision and recall. Both the classification report and the confusion matrix are straightforward to use - you pass the test and predicted labels to the functions:

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[1047    5]
 [   0  448]]

            precision    recall  f1-score   support

        0.0       1.00      1.00      1.00      1052
        1.0       0.99      1.00      0.99       448

avg / total       1.00      1.00      1.00      1500

The other metrics functions calculate and plot the Receiver Operating Characteristic (ROC) and the Area under Curve (AUC) of the ROC. You can read about ROC here:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

Sklearn: how to get mean squared error on classifying training data

1 Answers1