RF: high OOB accuracy by one class and very low accuracy by the other, with big class imbalance

Question

I am new to random forest classifier. I am using it to classify a dataset that has two classes. - The number of features is 512. - The proportion of the data is 1:4. I.e, 75% of the data is from the first class and 25% of the second one. - I am using 500 trees.

The classifier produces an out of bag error of 21.52%. The per class error for the first class (which is represented by 75% of the training data) is 0.0059. While the classification error for the second class is really high: 0.965.

I am looking for an explanation for this behaviour and if you have suggestion to improve the accuracy for the second class.

I am looking forwards to your help.

Thanks

In forgot to say that I'm using R and that I used nodesize of 1000 in the above test.

Here I repeated the training with only 10 trees and nodesize= 1 (just to give an idea) and below is the function call in R and the confusion matrix:

randomForest(formula = Label ~ ., data = chData30PixG12, ntree = 10,importance = TRUE, nodesize = 1, keep.forest = FALSE, do.trace = 50)
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 22
OOB estimate of error rate: 24.46%
Confusion matrix:
Irrelevant , Relevant , class.error
Irrelevant 37954 , 4510 , 0.1062076
Relevant 8775 , 3068 , 0.7409440

@larsmans It's probably the out-of-bag error which is (almost, sort of) test set error. — joran, Apr 24 '12 at 21:46
dataset is imbalanced, the easiest way (conceptually) to balance it is to add copies from second class to it, for proportion 1:4, that would be three copies for each observation, but this approach is naive and computationaly demanding — Qbik, Jan 27 '15 at 22:04

score 10 · Accepted Answer · answered Apr 24 '12 at 21:53

I agree with @usr that generally speaking when you see a Random Forest simply classifying (nearly) each observation as the majority class, this means that your features don't provide much information to distinguish the two classes.

One option is to run the Random Forest such that you over-sample observations from the minority class (rather than sampling with replacement from the entire data set). So you might specify that each tree is built on a sample of size N where you force N/2 of the observations to come from each class (or some other ratio of your choosing).

While that might help some, it is by no means a cure-all. It's might be more likely that you'll get more mileage out of finding better features that do a good job of distinguishing the classes than tweaking the RF settings.

thank you I will try you suggestion and I will consider other features. — user1354770, Apr 24 '12 at 22:37

score 5 · Answer 2 · edited May 23 '17 at 12:25

5

I'm surprised nobody has mentioned using the 'classwt' parameter. The weighted random forest (WRF) is specifically designed to fix this kind of problem.

See here: Stack Exchange question #1

And here: Stack Exchange question #2

Article on weighted random forests: PDF

edited May 23 '17 at 12:25

Community

1
1

answered Jan 23 '14 at 13:49

digbyterrell

3,449
2
24
24

The `classwts` parameter seems to be under improvement and on the wish-list of the `randomForest` package [News](http://cran.r-project.org/web/packages/randomForest/NEWS). Its counterpart in the `bigrf` package is the `yclasswts' parameter. – imriss Jul 02 '14 at 17:05

score 4 · Answer 3 · answered Jan 14 '13 at 21:57

Well, this is the typical class imbalance problem. Random forest is that type of classifier aiming at maximising accuracy of the model. When one class is accounted for majority of the data, the easiest way for the classifier to achieve the accuracy is classifying all the observations into the majority class. This gives a very high accuracy as 0.75 in your case, but a bad model-almost no correct classification for the minority class. There are a lot of ways handling this. The easier way is using undersampling of the majority class to balance the data and then train the model with this balanced data. Hope this could help you.

score 3 · Answer 4 · answered Aug 07 '13 at 20:04

You can try to balance the error results using sampsize = c(500,500) (ie. in each tree it will be used 500 of each class avoiding the problem of unbalance errors, you can change the numbers of course, as well size node so big it will make probably the trees really small (using a few variables in each one). You don't want to overtrain to the training set much (even when the RF model take care of that) but you want to use at least some of the variables in each tree.

Surprising this isn't upvoted, as it's the correct answer. – catastrophic-failure Jul 19 '16 at 00:15 — catastrophic-failure, Jul 19 '16 at 00:15

score 1 · Answer 5 · answered Apr 25 '12 at 08:26

1

If you will show your code, which caused such a bad classification, it will be useful. Now I see one reasson for such a bad performance - nodesize = 1000 is a too big value. How many observation in your dataset? Try to use default value of nodesize or set it to much less value.

answered Apr 25 '12 at 08:26

DrDom

4,033
1
21
23

1

Thanks for you answer. Actually, by using the default nodesize and using SMOTE oversampling I got much better results. – user1354770 Apr 29 '12 at 10:46

score 0 · Answer 6 · answered Apr 24 '12 at 21:43

0

Looks like the classifier failed completely to find structure in the data. The best it could do was to classify everything as class 1 because that is the most frequent class.

answered Apr 24 '12 at 21:43

usr

168,620
35
240
369

Probably there are one or two misclassifications. I'd like to take a look at the trees. – usr Apr 24 '12 at 21:46
@usr would you please tell me which kind of output I should provide you with. I am using R for the classification. – user1354770 Apr 24 '12 at 22:38

RF: high OOB accuracy by one class and very low accuracy by the other, with big class imbalance

6 Answers6