0

I'm having trouble trying to figure out the following: I am running Random Forest for classification of habitat use and have GPS data from 17 animals. My data frame depicts different habitat variables such as aspect and canopy cover at each used animal location and each unused, random location. Each used location is also identified by the ID number of the animal ( this column is called "lynx"). A column called "usvsa" codes used locations as 1 and unused locations as 0. Here's the top of my spatial points data frame called sdata3:

lynx usvsa   aspect canopy_cover clearcut_area       cti deciduous dist_draw dist_ridge 
311    1 252.3302      55.3704             0  7.311823         0   90.0000  484.66483            
311    1 263.1394      55.1528             0  6.857203         0  324.4996  305.94116            
311    1 249.6992      72.9272             0  6.612025         0  364.9658  212.13203            
311    1 194.4459      50.4428             0  6.330615         0  108.1665   67.08204     

Ok. So, I'd like to use Jackknifing to run Random Forest 17 times (since I have 17 individuals), leaving one animal out each run. The idea is to compare the results of each random forest run to make sure no one animal is having a disproportionately large effect on the model results. I've been reading about package "bootstrap" and the jackknife function: jackknife(x, theta, ...)

I get that I need to write a function for theta but I can't figure out how to put it all together so that each run of Random Forest leaves one animal out. Here is my Random Forest Model: randomForest(y ~ ., data=sdata3, ntree=b, importance=TRUE,norm.votes=TRUE, proximity=TRUE) I'd like to compare the importance values and oob error of each run. Any tips would be appreciated!!

Siguza
  • 21,155
  • 6
  • 52
  • 89
user3088823
  • 81
  • 3
  • 5
  • You do realize that the OOB error is specifically calculated only on trees where that case was _not included_ in the tree construction? So the OOB error is _already_ doing basically what you describe. – joran Jul 21 '14 at 22:11
  • I see what you're saying except that I'm trying to run the model holding one animal back for each run. So, the difference from run to run is that the data from one animal has been left out. I want to see if the overall OOB error (or the importance of the variables) is greatly affected if I leave the data out for any one animal indicating that that animal is selecting very different habitat from the others. – user3088823 Jul 21 '14 at 22:45
  • Nevermind, I think I follow now, although I think there's probably a simpler way of doing this than an explicit jackknife... – joran Jul 21 '14 at 22:50
  • I'm still trying to work this out and think that the main problem is that the Jackknife function wants to return a mean and what I want to explore are the results of each Random Forest run (oob error, prediction error...). I think perhaps I'm going to have to write a loop function that just runs the Random Forest model leaving one animal out each time and saves the results of each run to a csv file. I've done some reading on loops but any ideas would be appreciated! – user3088823 Aug 13 '14 at 15:13

0 Answers0