How to handle class imbalance in sklearn random forests. Should I use sample weights or class weight parameter

Question

I am trying to solve a binary classification problem with a class imbalance. I have a dataset of 210,000 records in which 92 % are 0s and 8% are 1s. I am using sklearn (v 0.16) in python for random forests .

I see there are two parameters sample_weight and class_weight while constructing the classifier. I am currently using the parameter class_weight="auto".

Am I using this correctly? What does class_weight and sample weight actually do and What should I be using ?

Yes I tried , class_weight="auto" and it is giving me good precision and recall, but I wanted to know what is going behind the hood and am I even doing it correctly — NG_21, Jan 07 '16 at 13:33

David Maust · Accepted Answer · 2016-01-07T17:56:34.103

Class weights are what you should be using.

Sample weights allow you to specify a multiplier for the impact a particular sample has. Weighting a sample with a weight of 2.0 roughly has the same effect as if the point was present twice in the data (although the exact effect is estimator dependent).

Class weights have the same effect, but it used for applying a set multiplier to every sample that falls into the specified class. In terms of functionality, you could use either, but class_weights is provided for convenience so you do not have to manually weight each sample. Also it is possible to combined the usage of the two in which the class weights are multiplied by the sample weights.

One of the main uses for sample_weights on the fit() method is to allow boosting meta-algorithms like AdaBoostClassifier to operate on existing decision tree classifiers and increase or decrease the weights of individual samples as needed by the algorithm.

How to handle class imbalance in sklearn random forests. Should I use sample weights or class weight parameter

1 Answers1