The Short
I'm trying to use tuneRF
to find the optimal mtry
value for my randomForest
function but I'm finding that the answer is extremely unstable and changes with run to run/different seeds. I would run a loop to see how it changes over a large number of runs but am unable to extract which mtry
has the lowest OOB error.
The Long
I have a data.frame
that has eight features but two of the features are inclusive meaning all the information in one is a subset of the other. As an example one feature could be a factor A ~ c("animal', "fish")
and another feature a factor B ~c("dog", "cat", "salmon", "trout")
. Hence all dogs and cats are animals and all salmon and trout are fish. These two variables are by far more significant than any of the other six. Hence if I run 3 forests, one that uses A, one that uses B and one that uses A & B, the last one seems to be do the best. I suspect this is because A &/or B are so significant that by including both I have double the chance of them being selected randomly as the initial feature. I further suspect that I shouldn't allow this to happen and that I should throw out A as a factor but I can not find any literature that actually says that.
Anyway getting back on track. I have two datasets tRFx
and tRFx2
the first of which contains 7 features including B but not A and the second which contains 8 features with both A and B. I'm trying to see what the optimal mtry
is for these two separate models, and then how they perform relative to each other. The problem is tuneRF
seems, at least in this case, to be very unstable.
For the first dataset, (includes Feature B but not A)
> set.seed(1)
> tuneRF(x = tRFx, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 2 OOB error = 17.73%
Searching left ...
Searching right ...
mtry = 3 OOB error = 17.28%
0.02531646 0.01
mtry = 4 OOB error = 18.41%
-0.06493506 0.01
mtry OOBError
2.OOB 2 0.1773288
3.OOB 3 0.1728395
4.OOB 4 0.1840629
> set.seed(3)
> tuneRF(x = tRFx, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 2 OOB error = 18.07%
Searching left ...
Searching right ...
mtry = 3 OOB error = 18.18%
-0.00621118 0.01
mtry OOBError
2.OOB 2 0.1806958
3.OOB 3 0.1818182
ie for seed 1 mtry=3
but seed=3 mtry=2
And for the second dataset (includes both Features A & B)
> set.seed(1)
> tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 3 OOB error = 17.51%
Searching left ...
mtry = 2 OOB error = 16.61%
0.05128205 0.01
Searching right ...
mtry = 4 OOB error = 16.72%
-0.006756757 0.01
mtry OOBError
2.OOB 2 0.1661055
3.OOB 3 0.1750842
4.OOB 4 0.1672278
> set.seed(3)
> tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 3 OOB error = 17.4%
Searching left ...
mtry = 2 OOB error = 18.74%
-0.07741935 0.01
Searching right ...
mtry = 4 OOB error = 17.51%
-0.006451613 0.01
mtry OOBError
2.OOB 2 0.1874299
3.OOB 3 0.1739618
4.OOB 4 0.1750842
ie for seed 1 mtry=2
but seed=3 mtry=3
I was going to run a loop to see which mtry
is optimal over a large number of simulations but don't know how to capture the optimal mtry
from each iteration.
I know that I can use
> set.seed(3)
> min(tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01))
mtry = 3 OOB error = 17.4%
Searching left ...
mtry = 2 OOB error = 18.74%
-0.07741935 0.01
Searching right ...
mtry = 4 OOB error = 17.51%
-0.006451613 0.01
[1] 0.1739618
but I don't want to capture the OOB error (0.1739618) but the optimal mtry
(3).
Any help (or even general comments on anything related to tuneRF
) greatly appreciated. For anybody else who happens to stumble upon this looking for tuneRF
help I also found this post helpful.
R: unclear behaviour of tuneRF function (randomForest package)
For what it's worth it seems that the optimal mtry for the smaller feature set (with non-inclusive features) is 3 and for the larger feature set is only 2, which initially is counter intuitive but when you consider the inclusive nature of A and B it does/may make sense.