0

Context

Using the ranger package in R, I am training a probability forest on a binary response.

Problem Summary

For particular combinations of data size and sample.fraction, all predicted probability values are precisely 0.5.

Small tweaks to either of these move the predictions away from precisely 0.5 to hovering near 0.5.

Question

Why?

Code

Raw class frequencies are 5/65 and 60/65. For demo purposes, use noise covariate only.

response <- c(rep(0, 5), rep(1, 60))
df <- data.frame(resp=as.factor(response), noise=rnorm(65))

Using the sample.fraction argument to approximate class balance in the sampling of observations, we train the model and observe the out-of-bag predicted values. All are exactly 0.5 with no variation.

r <- ranger::ranger(formula=resp~noise, 
                    probability=TRUE, replace=TRUE,
                    data=df,
                    # Lowering sample.fraction below 5/65 also results in predictions of 0.5. 
                    # Raising it to 6/65 changes the results.
                    sample.fraction = c(5/65, 5/65), 
                    num.trees=150,
                    importance='impurity',
                    keep.inbag = TRUE)
r$predictions

Making small tweaks to the problem setup moves the predicted values away from precisely ambiguous (even if they are still practically ambiguous). Let's say I instead use data of twice the size, but the same raw class proportions and sample.fraction argument.

# rf on something exactly as rare as 5/65, but more observations
df2 <- data.frame(resp=as.factor(rep(response, 2)),
                 noise=rnorm(130))
rf_df2 <- ranger::ranger(formula=resp~noise, 
                      probability=TRUE, replace=TRUE,
                      data=df2,
                      sample.fraction = c(5/65, 5/65), 
                      num.trees=150,
                      importance='impurity',
                      keep.inbag = TRUE)
rf_df2$predictions
cmcrowley
  • 95
  • 6

0 Answers0