2

The Short

I'm trying to use tuneRF to find the optimal mtry value for my randomForest function but I'm finding that the answer is extremely unstable and changes with run to run/different seeds. I would run a loop to see how it changes over a large number of runs but am unable to extract which mtry has the lowest OOB error.

The Long

I have a data.frame that has eight features but two of the features are inclusive meaning all the information in one is a subset of the other. As an example one feature could be a factor A ~ c("animal', "fish") and another feature a factor B ~c("dog", "cat", "salmon", "trout"). Hence all dogs and cats are animals and all salmon and trout are fish. These two variables are by far more significant than any of the other six. Hence if I run 3 forests, one that uses A, one that uses B and one that uses A & B, the last one seems to be do the best. I suspect this is because A &/or B are so significant that by including both I have double the chance of them being selected randomly as the initial feature. I further suspect that I shouldn't allow this to happen and that I should throw out A as a factor but I can not find any literature that actually says that.

Anyway getting back on track. I have two datasets tRFx and tRFx2 the first of which contains 7 features including B but not A and the second which contains 8 features with both A and B. I'm trying to see what the optimal mtry is for these two separate models, and then how they perform relative to each other. The problem is tuneRF seems, at least in this case, to be very unstable.

For the first dataset, (includes Feature B but not A)

> set.seed(1)
> tuneRF(x = tRFx, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)  
mtry = 2  OOB error = 17.73% 
Searching left ...
Searching right ...
mtry = 3    OOB error = 17.28% 
0.02531646 0.01 
mtry = 4    OOB error = 18.41% 
-0.06493506 0.01 
      mtry  OOBError
2.OOB    2 0.1773288
3.OOB    3 0.1728395
4.OOB    4 0.1840629
> set.seed(3)
> tuneRF(x = tRFx, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 2  OOB error = 18.07% 
Searching left ...
Searching right ...
mtry = 3    OOB error = 18.18% 
-0.00621118 0.01 
      mtry  OOBError
2.OOB    2 0.1806958
3.OOB    3 0.1818182

ie for seed 1 mtry=3 but seed=3 mtry=2

And for the second dataset (includes both Features A & B)

> set.seed(1)
> tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 3  OOB error = 17.51% 
Searching left ...
mtry = 2    OOB error = 16.61% 
0.05128205 0.01 
Searching right ...
mtry = 4    OOB error = 16.72% 
-0.006756757 0.01 
      mtry  OOBError
2.OOB    2 0.1661055
3.OOB    3 0.1750842
4.OOB    4 0.1672278
> set.seed(3)
> tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 3  OOB error = 17.4% 
Searching left ...
mtry = 2    OOB error = 18.74% 
-0.07741935 0.01 
Searching right ...
mtry = 4    OOB error = 17.51% 
-0.006451613 0.01 
      mtry  OOBError
2.OOB    2 0.1874299
3.OOB    3 0.1739618
4.OOB    4 0.1750842

ie for seed 1 mtry=2 but seed=3 mtry=3

I was going to run a loop to see which mtry is optimal over a large number of simulations but don't know how to capture the optimal mtry from each iteration.

I know that I can use

> set.seed(3)
> min(tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01))
mtry = 3  OOB error = 17.4% 
Searching left ...
mtry = 2    OOB error = 18.74% 
-0.07741935 0.01 
Searching right ...
mtry = 4    OOB error = 17.51% 
-0.006451613 0.01 
[1] 0.1739618

but I don't want to capture the OOB error (0.1739618) but the optimal mtry (3).

Any help (or even general comments on anything related to tuneRF) greatly appreciated. For anybody else who happens to stumble upon this looking for tuneRF help I also found this post helpful. R: unclear behaviour of tuneRF function (randomForest package)

For what it's worth it seems that the optimal mtry for the smaller feature set (with non-inclusive features) is 3 and for the larger feature set is only 2, which initially is counter intuitive but when you consider the inclusive nature of A and B it does/may make sense.

Community
  • 1
  • 1
SC.
  • 406
  • 1
  • 6
  • 13
  • 1
    I don't have an answer for your specific question, but I can say that the info in `A` is not truly a subset of `B` like you stated. For example, `fish` may be equivalent to `salmon|trout`, but it is not the same as either of them alone. No less than a model effect is the same as either of the individual terms in a regression model. So you have legitimate reason to include both features. – Special Sauce Dec 02 '15 at 01:23
  • Thanks @Special Sauce, greatly appreciated the input. That does make sense to me. – SC. Dec 02 '15 at 01:49

1 Answers1

5
  • There's not a big difference in performance in this case (and others) on which mtry you choose. Only if you wan't to win kaggle contests where winner takes all and then you would probably also be blending together many other learning algorithms in one huge ensemble. In practice you get almost the same predictions.

  • You don't need stepwise optimization when you test so few combinations of parameters. Just try them all and repeat many times to figure out which mtry is slightly better.

  • All the times I have used tuneRF, I have been disappointed. Every time I ended up writing my own stepwise optimization or simply tried all combinations many times.

  • The mtry vs. oob-err do not have to be a smooth curve with a single minimum, though general trend should be observed. I't can be difficult to tell if a minimum is due to noise or general tendency.

I wrote an example of to do solid mtry screening. The conclusion from this screening would be there's not much difference. mtry=2 seems best and it would be slightly faster to compute. Default value had been mtry=floor(ncol(X)/3) anyways.

library(mlbench)
library(randomForest)
data(PimaIndiansDiabetes)
y = PimaIndiansDiabetes$diabetes
X = PimaIndiansDiabetes
X = X[,!names(X)%in%"diabetes"]
nvar = ncol(X)
nrep = 25 
rf.list = lapply(1:nvar,function(i.mtry) {
  oob.errs = replicate(nrep,{
    oob.err = tail(randomForest(X,y,mtry=i.mtry,ntree=2000)$err.rate[,1],1)})
})
plot(replicate(nrep,1:nvar),do.call(rbind,rf.list),col="#12345678",
     xlab="mtry",ylab="oob.err",main="tuning mtry by oob.err")
rep.mean = sapply(rf.list,mean)
rep.sd = sapply(rf.list,sd)
points(1:nvar,rep.mean,type="l",col=3)
points(1:nvar,rep.mean+rep.sd,type="l",col=2)
points(1:nvar,rep.mean-rep.sd,type="l",col=2)

enter image description here

Soren Havelund Welling
  • 1,823
  • 1
  • 16
  • 23