14

I am using a Random Forest classifer in scikit learn with an imbalanced data set of two classes. I am much more worried about false negatives than false positives. Is it possible to fix the false negative rate (to, say, 1%) and ask scikit to optimize the false positive rate somehow?

If this classifier doesn't support it, is there another classifier that does?

Simd
  • 19,447
  • 42
  • 136
  • 271
  • 1
    You may be able to use the `predict_proba` method of the classifier to set your own discrimination threshold. – BrenBarn Oct 13 '15 at 04:56

3 Answers3

6

I believe the problem of class imbalance in sklearn can be partially resolved by using the class_weight parameter.

this parameter is either a dictionary, where each class is assigned a uniform weight, or is a string that tells sklearn how to build this dictionary. For instance, setting this parameter to 'auto', will weight each class in proportion of the inverse of its frequency.

By weighting the class that is less present with a higher amount, you can end up with 'better' results.

Classifier like like SVM or logistic regression also offer this class_weight parameter.

This Stack Overflow answer gives some other ideas on how to handle class imbalance, like under sampling and oversampling.

Community
  • 1
  • 1
DJanssens
  • 17,849
  • 7
  • 27
  • 42
  • RandomForestClassifier also has class_weight in master (and will have it in the release version in a week or so). – Andreas Mueller Sep 19 '15 at 17:31
  • @AndreasMueller Thank you. If I really only care about a fixed false negative weight, does it make sense to specify as the loss function the false positive weight and try to optimize using one of the classifier that supports user defined loss functions? – Simd Oct 05 '15 at 17:54
  • @AndreasMueller One other thing. The 0.16.1 documentation claims that RandomForestClassifier has class_weight . Is this not functional currently? – Simd Oct 05 '15 at 18:01
  • 1
    It should be working. There are not models with user defined loss functions. You can select hyper-parameter based on user defined scorers, though. – Andreas Mueller Oct 06 '15 at 22:59
4

I found this article on class imbalance problem.

http://www.chioka.in/class-imbalance-problem/

It has basically discussed the following possible solutions to summarize:

  • Cost function based approaches
  • Sampling based approaches
  • SMOTE (Synthetic Minority Over-Sampling Technique)
  • recent approaches : RUSBoost, SMOTEBagging and Underbagging

Hope It may help.

jeffery_the_wind
  • 17,048
  • 34
  • 98
  • 160
Pappu Jha
  • 477
  • 1
  • 3
  • 14
1

Random forests is already a bagged classifier so that should already give some good results.

One typical way of getting desired False positive or False negative accuracies is to analyze it using ROC curves http://scikit-learn.org/stable/auto_examples/plot_roc.html and modifying certain parameters to achieve the desired FP rates for example.

Not sure whether it would be possible to tune the random forest classifier FP rates using parameters. You can look at other classifiers based on your application.

Kunal Grover
  • 190
  • 8