I have a question relating to the “randomForest” package in R. I am trying to build a model with ecological variables that best explain my species occupancy data for 41 sites in the field (which I have gathered from camera traps). My ultimate goal is to do species occupancy modeling using the “unmarked” package but before I get to that stage I need to select the variables that are best explaining my occupancy, since I have many. To gain some understanding of the randomForest package I generated a fake occupancy dataset and a fake variable dataset (with variables A and D being good predictors of my occupancy and B and C being bad predictors). When I run the randomForest my output looks like this:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 25.3537667 27.75533 26.9634018 20.6505920
B 0.9567857 0.00000 0.9665287 0.0728273
C 0.4261638 0.00000 0.4242409 0.1411643
D 32.1889374 35.52439 34.0485837 27.0691574
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
I did not make a separate train and test set, I put extra weight on the model to correctly predict the “1’s” and the variables are scaled.
I understand that this output tells me that A and D are important variables because they have high MeanDecreaseAccuracy values. However, D is the inverse of A (they are perfectly correlated) so why does D have a higher MeanDecreaseAccuracy value?
Moreover, when I run the randomForest with only A and D as variables, these values change while the confusion matrix stays the same:
0 1 MeanDecreaseAccuracy MeanDecreaseGini
A 28.79540 29.77911 29.00879 23.58469
D 29.75068 30.79498 29.97520 24.53415
OOB estimate of error rate: 29.02%
Confusion matrix:
0 1 class.error
0 250 119 0.3224932
1 0 41 0.0000000
When I run the model with only 1 good predictor (A or D) or with a good and bad predictor (AB or CD) the confusion matrix stays the same but the MeanDecreaseAccuracy values of my predictors change. Why do these values change and how should I approach the selection of my variables? (I am a beginner in occupancy modeling).
Thanks a lot!