27

I have a highly imbalanced data set with target class instances in the following ratio 60000:1000:1000:50 (i.e. a total of 4 classes). I want to use randomForest for making predictions of the target class.

So, to reduce the class imbalance, I played with sampsize parameter, setting it to c(5000, 1000, 1000, 50) and some other values, but there was not much use of it. Actually, the accuracy of the 1st class decreased while I played with sampsize, though the improvement in other class predictions was very minute.

While digging through the archives, I came across two more features of randomForest(), which are strata and classwt that are used to offset class imbalance issue.

All the documents upon classwt were old (generally belonging to the 2007, 2008 years), which all suggested not the use the classwt feature of randomForest package in R as it does not completely implement its complete functionality like it does in fortran. So the first question is:
Is classwt completely implemented now in randomForest package of R? If yes, what does passing c(1, 10, 10, 10) to the classwt argument represent? (Assuming the above case of 4 classes in the target variable)

Another argument which is said to offset class imbalance issue is stratified sampling, which is always used in conjunction with sampsize. I understand what sampsize is from the documentation, but there is not enough documentation or examples which gave a clear insight into using strata for overcoming class imbalance issue. So the second question is:
What type of arguments have to be passed to stratain randomForest and what does it represent?

I guess the word weight which I have not explicitly mentioned in the question should play a major role in the answer.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
StrikeR
  • 1,598
  • 5
  • 18
  • 35
  • 2
    I would run a forest on only the three smaller classes. That will give you a sense of how well a rf model could possibly distinguish those three classes without the dominant class at all. If the accuracy is still fairly low, then class imbalance is probably not your real problem, rather those three classes are just not easily distinguished with the features you have. – joran Nov 27 '13 at 20:23
  • Thanks @joran . Sorry for a little confusion, here is the actual class instance ratio which I also have changed in the question: 60000:1000:1000:50. Do you think omitting the first class in this case is going to help? Because when I run RF with all 4 classes I get the accuracy in the following order for each class: c(90%, 70%, 70%, less than 10%). I'm more concerned about improving the accuracy of 4th class which is less than 10%. And one more thing is, is the `classwt` correctly implemented in `randomForest` of R as of now? – StrikeR Nov 28 '13 at 04:30
  • 4
    As for `classwt` implementation - I suppose it isn't implemented, because http://cran.r-project.org/web/packages/randomForest/NEWS you can read that "* Implement the new scheme of handling classwt in classification." is in wishlist. – BartekCh Nov 28 '13 at 20:05
  • Examples of how to use `strata` and `sampsize` can be now found on two SO posts, [here](http://stackoverflow.com/questions/14842059/stratified-sampling-with-random-forests-in-r) and [here](http://stackoverflow.com/questions/20150525/stratified-sampling-doesnt-seem-to-change-randomforest-results) – Tchotchke Aug 19 '15 at 21:33
  • 4
    Yes, I don't think `classwt` is implemented. I tried it with different values and my results are identical to running with the default setting, where `classwt = NULL` – Zhubarb Sep 23 '15 at 07:03
  • classwt IS implemented. it is changing my predictions. – Joshua Stafford Jan 19 '16 at 23:10
  • `classwt` is correctly passed to the underlying code of `randomForest`, [check it](https://github.com/cran/randomForest/blob/master/R/randomForest.default.R#L212). – catastrophic-failure Jul 15 '16 at 14:13
  • 1
    Try larger values. I found that I only got results when I increased the weights a few orders of magnitude. For example, c(1,10,50) was unable to change a thing, but c(1,10,50000) started to make a difference. Why is this? – Joshua Jul 19 '16 at 10:02

3 Answers3

4

classwt is correctly passed on to randomForest, check this example:

library(randomForest)
rf = randomForest(Species~., data = iris, classwt = c(1E-5,1E-5,1E5))
rf

#Call:
# randomForest(formula = Species ~ ., data = iris, classwt = c(1e-05, 1e-05, 1e+05)) 
#               Type of random forest: classification
#                     Number of trees: 500
#No. of variables tried at each split: 2
#
#        OOB estimate of  error rate: 66.67%
#Confusion matrix:
#           setosa versicolor virginica class.error
#setosa          0          0        50           1
#versicolor      0          0        50           1
#virginica       0          0        50           0

Class weights are the priors on the outcomes. You need to balance them to achieve the results you want.


On strata and sampsize this answer might be of help: https://stackoverflow.com/a/20151341/2874779

In general, sampsize with the same size for all classes seems reasonable. strata is a factor that's going to be used for stratified resampling, in your case you don't need to input anything.

Community
  • 1
  • 1
catastrophic-failure
  • 3,759
  • 1
  • 24
  • 43
3

You can pass a named vector to classwt. But how weight is calculated is very tricky.

For example, if your target variable y has two classes "Y" and "N", and you want to set balanced weight, you should do:

wn = sum(y="N")/length(y)
wy = 1

Then set classwt = c("N"=wn, "Y"=wy)

Alternatively, you may want to use ranger package. This package offers flexible builds of random forests, and specifying class / sample weight is easy. ranger is also supported by caret package.

Code Learner
  • 191
  • 11
2

Random forests are probably not the right classifier for your problem as they are extremely sensitive to class imbalance.

When I have an unbalanced problem I usually deal with it using sampsize like you tried. However I make all the strata equal size and I use sampling without replacement. Sampling without replacement is important here, as otherwise samples from the smaller classes will contain many more repetitions, and the class will still be underrepresented. It may be necessary to increase mtry if this approach leads to small samples, sometimes even setting it to the total number of features.

This works quiet well when there are enough items in the smallest class. However, your smallest class has only 50 items. I doubt you would get useful results with sampsize=c(50,50,50,50).

Also classwt has never worked for me.

Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90