0

I have a question relating to the “randomForest” package in R. I am trying to build a model with ecological variables that best explain my species occupancy data for 41 sites in the field (which I have gathered from camera traps). My ultimate goal is to do species occupancy modeling using the “unmarked” package but before I get to that stage I need to select the variables that are best explaining my occupancy, since I have many. To gain some understanding of the randomForest package I generated a fake occupancy dataset and a fake variable dataset (with variables A and D being good predictors of my occupancy and B and C being bad predictors). When I run the randomForest my output looks like this:

           0        1 MeanDecreaseAccuracy MeanDecreaseGini
A 25.3537667 27.75533           26.9634018       20.6505920
B  0.9567857  0.00000            0.9665287        0.0728273
C  0.4261638  0.00000            0.4242409        0.1411643
D 32.1889374 35.52439           34.0485837       27.0691574

        OOB estimate of  error rate: 29.02%
Confusion matrix:
    0   1 class.error
0 250 119   0.3224932
1   0  41   0.0000000

I did not make a separate train and test set, I put extra weight on the model to correctly predict the “1’s” and the variables are scaled.

I understand that this output tells me that A and D are important variables because they have high MeanDecreaseAccuracy values. However, D is the inverse of A (they are perfectly correlated) so why does D have a higher MeanDecreaseAccuracy value?

Moreover, when I run the randomForest with only A and D as variables, these values change while the confusion matrix stays the same:

         0        1 MeanDecreaseAccuracy MeanDecreaseGini
A 28.79540 29.77911             29.00879         23.58469
D 29.75068 30.79498             29.97520         24.53415

        OOB estimate of  error rate: 29.02%
Confusion matrix:
    0   1 class.error
0 250 119   0.3224932
1   0  41   0.0000000

When I run the model with only 1 good predictor (A or D) or with a good and bad predictor (AB or CD) the confusion matrix stays the same but the MeanDecreaseAccuracy values of my predictors change. Why do these values change and how should I approach the selection of my variables? (I am a beginner in occupancy modeling).

Thanks a lot!

Fleur
  • 11
  • 2
  • This is a well-put question @Fleur, but you may get better answers at [Cross Validated](https://stats.stackexchange.com/) which is a statistics-focused forum, whereas SO is for programming _per se_. Good luck! – SDS0 Jan 23 '20 at 11:17
  • Hey @Fleur, the core of your question revolves around how random forest works. It is an ensembl of many decision trees, and with every tree, only a subset of variables are used for splitting. So if two variables are correlated, they are equally good depending on the subset, and on the whole, they will have very close importance values – StupidWolf Jan 23 '20 at 11:50
  • Because of the sampling, if you rerun your randomForest, you will see that the meanAccuracy etc will change... So beware of interpreting the MeanDecreaseAccuracy with n=1 – StupidWolf Jan 23 '20 at 11:51

0 Answers0