Random Forest - how mtry is larger than total number of independent variable?

Question

1) I tried regression Random Forest for training set of 185 rows with 4 independent variables. 2 categorical variables have each of 3 levels and 13 levels. Another 2 variables are numeric continuous variables.

I tried RF with cross validation of 10 fold repeated 4 times. (I didn't scale dependent variable and that's why RMSE is so big.)

I guess the reason mtry is bigger than 4 is that the categorical variables has 3+13= 16 levels total. But if so, why it does not include the numeric variables number?

185 samples
4 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times) 
Summary of sample sizes: 168, 165, 166, 167, 166, 167, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE    
   2    16764183  0.7843863  9267902
   9     9451598  0.8615202  3977457
  16     9639984  0.8586409  3813891

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.

Please help me on understanding mtry.

2) Also, each fold sample size is 168,165,166,...., and why the sample size is changing?

sample sizes: 168, 165, 166, 167, 166, 167

Thank you so much.

score 1 · Accepted Answer · edited Jul 01 '22 at 15:03

You are correct in that there are 16 variables to sample from, hence the maximum for mtry is 16.

The values chosen by caret is based on two parameters, in train, there is an option for tuneLength which is by default 3:

tuneLength = ifelse(trControl$method == "none", 1, 3)

This means it test three values. For randomForest, you have mtry, and the default is:

caret::getModelInfo("rf")[[1]]$grid
#> function (x, y, len = NULL, search = "grid") 
#> {
#>     if (search == "grid") {
#>         out <- data.frame(mtry = caret::var_seq(p = ncol(x), 
#>             classification = is.factor(y), len = len))
#>     }
#>     else {
#>         out <- data.frame(mtry = unique(sample(1:ncol(x), size = len, 
#>             replace = TRUE)))
#>     }
#>     out
#> }

^{Created on 2022-07-01 by the reprex package (v2.0.1)}

Since you have 16 columns, it becomes:

var_seq(16,len=3)
[1]  2  9 16

You can test the mtry of your choice by setting:

library(caret)
trCtrl = trainControl(method="repeatedcv",repeats=4,number=10)
# we test 2,4,6..16
trg = data.frame(mtry=seq(2,16,by=2))
# some random data for example
df = data.frame(y=rnorm(200),x1 = sample(letters[1:13],200,replace=TRUE),
x2=sample(LETTERS[1:3],200,replace=TRUE),x3=rpois(200,10),x4=runif(200))

#fit
mdl = train(y ~.,data=df,tuneGrid=trg,trControl =trCtrl)

Random Forest 

200 samples
  4 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared    MAE      
   2    1.120216  0.04448700  0.8978851
   4    1.157185  0.04424401  0.9275939
   6    1.172316  0.04902991  0.9371778
   8    1.186861  0.05276752  0.9485516
  10    1.193595  0.05490291  0.9543479
  12    1.200837  0.05608624  0.9574420
  14    1.205663  0.05374614  0.9621094
  16    1.210783  0.05537412  0.9665665

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.

Random Forest - how mtry is larger than total number of independent variable?

1 Answers1