How can the model be tuned to improve precision, when precision is much more important than recall?

Question

The performance of a machine learning classifier can be measured by a variety of metrics like precision, recall, and classification accuracy, among other metrics.

Given code like this:

clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)

What metric is the fit function trying to optimze?
How can the model be tuned to improve precision, when precision is much more important than recall?

score 4 · Answer 1 · answered May 02 '15 at 09:26

4

You can tune parameters of your SVM by using Grid Search Cross Validation to maximize your precision. To do so, set the parameter "scoring" like

sklearn.grid_search.GridSearchCV(clf, param_grid, scoring="precision")

Here clf is your SVC classifier and, of course, you you also need to set the grid of parameters param_grid. See examples here

answered May 02 '15 at 09:26

lanenok

2,699
17
24

I'm not sure how good an idea this is, as you can get 100% by setting the threshold appropriately... Probably that won't happen, still not that principled. – Andreas Mueller May 04 '15 at 01:06
@Andreas Mueller Sure, there are several _strategies_ on improving model performance. This is the actual work you are doing when exploring your dataset. Without any info about the dataset, I think, this question is about the scikit-learn API. – lanenok May 04 '15 at 05:03

score 3 · Answer 2 · answered May 01 '15 at 21:08

As far as I know, SVMs minimize the hinge loss.
I'm not aware of any general-purpose way to make a support vector classifier prioritize precision over recall. As always, you can use cross validation and then play with the hyperparameters to see if anything helps. Alternatively, you could train a regressor outputting a value in [0,1] instead of a classifier. Then, by choosing a proper threshold such that you put all examples getting a score above that threshold into category '1', you get a classifier with a tunable threshold parameter which you can set arbitrarily high to maximize precision over recall.

Andreas Mueller · Answer 3 · 2015-05-04T13:30:25.947

I see two ways: optimizing by grid-searching for parameters, as @laneok suggests, or optimizing by adjusting a threshold as @cfh suggests.

Optimally you should do both.

I would not try to only optimize precision, as you usually get 100% precision by setting a very high threshold and getting very low recall. So if possible, you could define a trade-off between precision and recall that you like, and grid-search over that.

You can probably get better results for that if you actually do pick a separate threshold. You can use the SVC.decision_function to get a continuous output, and then pick the optimum threshold for the tradeoff you want to achieve. To pick the threshold you would need a validation set, though, which makes doing this inside the grid-search a bit more tricky (not impossible, though).

What I usually find is a good trade-off between optimizing what you want and complexity of pipeline is to optimize in the grid-search for something that will take precision into account, say "roc_auc", and after the grid-search pick a threshold on a validation set based on the tradeoff you like.

roc_auc basically optimizes for all possible thresholds simultaneously, so the parameters will not be as specific for the threshold you want as they could be.

You are right, though you would get NaN precision if you never predict anything. Fixed my answer. — Andreas Mueller, May 04 '15 at 13:31

score 1 · Answer 4 · answered Jun 20 '18 at 23:13

There is a technique where you can write you own loss function to focus on the ranking metrics like(AUC, Precision-Recall) rather than the classification losses like(hinge loss or log-loss).

Refer section 4( Maximizing Recall at Fixed Precision) of the paper Scalable Learning of Non-Decomposable Objectives (https://arxiv.org/pdf/1608.04802.pdf) for more details.

How can the model be tuned to improve precision, when precision is much more important than recall?

4 Answers4