9

The performance of a machine learning classifier can be measured by a variety of metrics like precision, recall, and classification accuracy, among other metrics.

Given code like this:

clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
  1. What metric is the fit function trying to optimze?

  2. How can the model be tuned to improve precision, when precision is much more important than recall?

nbro
  • 15,395
  • 32
  • 113
  • 196
steve
  • 583
  • 1
  • 5
  • 11

4 Answers4

4

You can tune parameters of your SVM by using Grid Search Cross Validation to maximize your precision. To do so, set the parameter "scoring" like

sklearn.grid_search.GridSearchCV(clf, param_grid, scoring="precision")

Here clf is your SVC classifier and, of course, you you also need to set the grid of parameters param_grid. See examples here

lanenok
  • 2,699
  • 17
  • 24
  • I'm not sure how good an idea this is, as you can get 100% by setting the threshold appropriately... Probably that won't happen, still not that principled. – Andreas Mueller May 04 '15 at 01:06
  • @Andreas Mueller Sure, there are several _strategies_ on improving model performance. This is the actual work you are doing when exploring your dataset. Without any info about the dataset, I think, this question is about the scikit-learn API. – lanenok May 04 '15 at 05:03
3
  1. As far as I know, SVMs minimize the hinge loss.

  2. I'm not aware of any general-purpose way to make a support vector classifier prioritize precision over recall. As always, you can use cross validation and then play with the hyperparameters to see if anything helps. Alternatively, you could train a regressor outputting a value in [0,1] instead of a classifier. Then, by choosing a proper threshold such that you put all examples getting a score above that threshold into category '1', you get a classifier with a tunable threshold parameter which you can set arbitrarily high to maximize precision over recall.

cfh
  • 4,576
  • 1
  • 24
  • 34
3

I see two ways: optimizing by grid-searching for parameters, as @laneok suggests, or optimizing by adjusting a threshold as @cfh suggests.

Optimally you should do both.

I would not try to only optimize precision, as you usually get 100% precision by setting a very high threshold and getting very low recall. So if possible, you could define a trade-off between precision and recall that you like, and grid-search over that.

You can probably get better results for that if you actually do pick a separate threshold. You can use the SVC.decision_function to get a continuous output, and then pick the optimum threshold for the tradeoff you want to achieve. To pick the threshold you would need a validation set, though, which makes doing this inside the grid-search a bit more tricky (not impossible, though).

What I usually find is a good trade-off between optimizing what you want and complexity of pipeline is to optimize in the grid-search for something that will take precision into account, say "roc_auc", and after the grid-search pick a threshold on a validation set based on the tradeoff you like.

roc_auc basically optimizes for all possible thresholds simultaneously, so the parameters will not be as specific for the threshold you want as they could be.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
1

There is a technique where you can write you own loss function to focus on the ranking metrics like(AUC, Precision-Recall) rather than the classification losses like(hinge loss or log-loss).

Refer section 4( Maximizing Recall at Fixed Precision) of the paper Scalable Learning of Non-Decomposable Objectives (https://arxiv.org/pdf/1608.04802.pdf) for more details.