Stratified sampling doesn't seem to change randomForest results

Question

I am using the randomForest package in R to build several species distribution models. My response variable is binary (0 - absence or 1-presence), and pretty unbalanced - for some species the ratio of absences:presences is 37:1. This imbalance (or zero-inflation) leads to questionable out-of-bag error estimates - the larger the ratio of absences to presence, the lower my out-of-bag (OOB) error estimate.

To compensate for this imbalance, I wanted to implement stratified sampling such that each tree in the random forest included an equal (or at least less imbalanced) number of results from both the presence and absences category. I was surprised that there doesn't seem to be any difference in the stratified and unstratified model OOB error estimates. See my code below:

Without stratification

> set.seed(25)
> HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
> HHrf
Call:
  randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 19.1%
Confusion matrix:
    0  1 class.error
0 422 18  0.04090909
1  84 10  0.89361702

With stratification

> HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, strata = bll_HH$HH_Pres, sampsize = ceiling(.632*nrow(bll_HH)))
> HHrf

Call:
 randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 19.1%
Confusion matrix:
    0  1 class.error
0 422 18  0.04090909
1  84 10  0.89361702

Is there a reason that I am getting the same results in both cases? For the strata argument, I specify my response variable, HH_Pres. For the sampsize argument, I specify that it should just be 63.2% of the entire dataset.

Anyone know what I am doing wrong? Or is this to be expected?

Thanks,

Liza

To reproduce this problem:

Sample data: https://docs.google.com/file/d/0B-JMocik79JzY3B4U3NoU3kyNW8/edit

Code:

bll = read.csv("bll_Nov2013_NMV.csv", header=TRUE)
HH_Pres <- bll$HammerHeadALL_Presence

Slope <-bll$Slope
Dist2Shr <-bll$Dist2Shr
Bathy <-bll$Bathy2
Chla <-bll$GSM_Chl_Daily_MF
SST <-bll$SST_PF_daily
Region <- bll$Region
MoonPhase <-bll$MoonPhase
DaylightHours <- bll$DaylightHours
bll_HH <- data.frame(HH_Pres, Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region)
set.seed(25)

HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
HHrf
set.seed(25)
HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, strata = bll_HH$HH_Pres, sampsize = c(100, 50), ntree = 500, replace = FALSE, importance = TRUE)
HHrf

I know this is an old question, but... you're naming the second model as `HHrf_strata`, but checking the output for `HHrf` both times... shouldn't the last line be `HHrf_strata` instead? — Aramis7d, Jul 12 '18 at 11:42

ialm · Accepted Answer · 2013-11-22T22:50:07.353

As far as I know, the sampsize argument should be a vector that is the same length as the number of classes in your data set. If you specify a factor variable in the strata argument, then sampsize should be given a vector that is the same length as the number of factors in the strata argument. I am not sure it performs as you describe in your question, but it has been a while since I have used the randomForest function.

From the help files, it says:

strata

A (factor) variable that is used for stratified sampling.

sampsize:

Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

For example, since your classification has 2 distinct classes, you need to give sampsize a vector of length 2 that specifies how many observations you want to sample from each class during training time.

e.g. sampsize=c(100,50)

Furthermore, you can specify the names of the groups to be extra clear.

e.g. sampsize=c('0'=100, '1'=50)

An example from the help files that uses the sampsize argument, to clarify:

## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.
data(iris)
(iris.rf2 <- randomForest(iris[1:4], iris$Species, sampsize=c(20, 30, 20)))

EDIT: Added some notes about the strata argument in randomForest.

EDIT: Make sure the strata argument is given a factor variable!

e.g. try strata = factor(HH_Pres), sampsize = c(...) where c(...) is a vector that is the same length as length(levels(factor(bll_HH$HH_Pres)))

EDIT:

OK, I tried running the code with your data, and it works for me.

# Fix up the data set to have HH_Pres and Region as factors
bll_HH$Region <- factor(bll_HH$Region)
bll_HH$HH_Pres <- factor(bll_HH$HH_Pres)

# Original RF code
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
                      Slope + MoonPhase + Chla + Region,
                    data=bll_HH, ntree = 500, replace = FALSE, 
                    importance = TRUE, na.action = na.omit)
HHrf

# Output
#         OOB estimate of  error rate: 18.91%
# Confusion matrix:
#     0  1 class.error
# 0 425 15  0.03409091
# 1  86  8  0.91489362

# Take 63.2% from each class
mySampSize <- ceiling(table(bll_HH$HH_Pres) * 0.632)

set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
                       Slope + MoonPhase + Chla + Region,
                     data=bll_HH, ntree = 500, replace = FALSE, 
                     importance = TRUE, na.action = na.omit,
                     sampsize=mySampSize)
HHrf
# Output
#         OOB estimate of  error rate: 18.91%
# Confusion matrix:
#     0  1 class.error
# 0 424 16  0.03636364
# 1  85  9  0.90425532

Note that the OOB error estimate is the same in this case, even if we only use 63.2% of the data from each of the classes in our bootstrap samples. This is probably due to using sample sizes that are proportional to the class distribution in your training data, and the relatively small size of your data set. Let's try changing mySampSize to make sure it REALLY worked.

# Change mySampSize. Sample 100 from class 0 and 50 from class 1
mySampSize[1] <- 100
mySampSize[2] <- 50

set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
                       Slope + MoonPhase + Chla + Region,
                     data=bll_HH, ntree = 500, replace = FALSE, 
                     importance = TRUE, na.action = na.omit,
                     sampsize=mySampSize)
HHrf
# Output
#         OOB estimate of  error rate: 21.16%
# Confusion matrix:
#     0  1 class.error
# 0 382 58   0.1318182
# 1  55 39   0.5851064

Thanks so much for your answer! I understand what the sampsize argument a lot better now. Unfortunately I am still getting the same OOB error estimate from the stratified and non-stratified models. I was wondering, is it weird that when I print the stratified model after running it, the "sampsize" and "strata" arguments no longer appear after the Call:? See my original example. — lah, Nov 22 '13 at 18:09
@Liza I cannot try your example without some sample data. But, try changing the `strata` and `sampsize` arguments - they may not be valid, and `randomForest` may be ignoring them. — ialm, Nov 22 '13 at 18:15
Here is some sample data. If you have a chance to test it out, I would really appreciate it! [Sample data](https://docs.google.com/file/d/0B-JMocik79JzY3B4U3NoU3kyNW8/edit). I'll edit my original example to include all of the variable names. — lah, Nov 22 '13 at 19:33

Jeffrey Evans · Answer 2 · 2013-11-22T21:00:51.143

This syntax seems to be working fine for me on your data. The OOB is 32.21% and the class error(s): 0.32, 0.29. I did kick up the number of Bootstraps to 1000. I always recommend using indexing to define a RF model. In certain circumstances, symbolic syntax seems to be unstable.

require(randomForest)
  HHrf <- read.csv("bll_HH.csv")
    set.seed(25)    
( rf.mdl <- randomForest( y=as.factor(HHrf[,"HH_Pres"]), x=HHrf[,2:ncol(HHrf)],
                          strata=as.factor(HHrf[,"HH_Pres"]), sampsize=c(50,50),
                          ntree=1000) )

score 0 · Answer 3 · answered Feb 01 '21 at 02:16

I ran into this problem too. What I noticed is that my error rate when using importance = TRUE changes significantly. It is not the same as if I did not choose stratification with sampling.

For me it ended up being a tradeoff in not having an importance/accuracy score for my classification tree. It appears to be one of many bugs in this implementation.

Stratified sampling doesn't seem to change randomForest results

Without stratification

With stratification

3 Answers3

Linked