6

My question: How do I obtain the training error in the svm module (SVC class)?

I am trying to do a plot of error of the train set and test set against the number of training data used ( or other features such as C / gamma ). However, according to the SVM documentation , there is no such exposed attribute or method to return such data. I did find that RandomForestClassifier does expose a oob_score_ though.

log0
  • 541
  • 2
  • 7
  • 11
  • The value gotten through the code snippet in the answer above, is it ACCURACY or ERROR? Sorry I posted it as an answer, I can't comment on the previous post because I have less than 50 'reputations' – Samson Jul 12 '18 at 00:17

2 Answers2

12

Just compute the score on the training data:

>>> model.fit(X_train, y_train).score(X_train, y_train)

You can also use any other performance metrics from the sklearn.metrics module. The doc is here:

http://scikit-learn.org/stable/modules/model_evaluation.html

Also: oob_score_ is an estimate of the test / validation score, not the training score.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Thanks Olivier for pointing out. This is still puzzling : the training error is 0.0 for a couple of datasets which I tried the method above. I even tried the digit recognition on Kaggle but still yields 0 training error with a random forest with just 1 tree. The test error is rather high though. How come? (From what I read at the videos on Andrew Ng, you still get a decreasing curve, not a perfect 0.0 training error). – log0 Jul 31 '13 at 09:15
  • 3
    This is expected: training error can be zero while testing error is rarely so. A large gap between the two denote overfitting (bad use or memory capacity that prevents good generalization). A large training error denote underfitting (not enough memory capacity in the model). Tree models are instance learners: they can memorize a full dataset with a single unfolded tree if you don't constrain them to a limited depth. – ogrisel Jul 31 '13 at 09:25
  • 3
    The lack of underfitting is not an issue, but the presence of overfitting is. Use random forests or other randomized ensembles of trees to combat the overfitting behavior of a single tree. – ogrisel Jul 31 '13 at 09:25
  • Thank you Olivier! You cleared up a lot of question marks in my head. Though, I'll also definitely test it with other algorithms (less complex models) just to verify my understanding. – log0 Jul 31 '13 at 12:58
  • 1
    In case people refer to this in the future : I tried naive_bayes.GaussianNB, naive_bayes.BernoulliNB, NearestCentroid and a few other non-instance based learners, training error is non-zero and thus confirms above Olivier's explanation. Thanks again above. – log0 Jul 31 '13 at 13:24
0

You can even plot learning curve using 'learning_curve'. Here is an example.

>>> from sklearn.model_selection import learning_curve
    
>>> train_sizes, train_scores, valid_scores = learning_curve(
...     SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)

Refer this for more details- https://scikit-learn.org/stable/modules/learning_curve.html