0

I am getting confused by the behaviour of the varimp() function from party package.

I am using conditional random forest to get variable importance following Strobl et al. 2009 recommendations.

It works just fine for all my datasets but one. I have to subset my observation for this one. But, even if the conditional random forest run normaly on the full dataset, it returns only zeros for the subset... and seem to don't run at all but no error is generated.

I wondered if the number of predictors is too much for only few observations and try with only a restricted number of predictors but it gives me the same results. It does not seem either to be link to variable type as it was pointed out in other cases...

I am obviously missing something but I just can not figure out what...

If someone has an insight of the direction I should be looking at, I would be so grateful.

My data here.

VarforCRF <- read.csv("Data.csv",sep=";",dec=",",row.names=1)
library(party)
set.seed(round(runif(1,0,1)*10000))

# Run just fine with the entire dataset
cRF <- cforest(Syrph_pred~.,data = VarforCRF, control = cforest_unbiased(ntree=100))
varimp(object = cRF,conditional = T)


CRF_West <- subset(VarforCRF,Sector == "West") 

# Does not seem to run at all with subset and return zeros
cRF_W <- cforest(Syrph_pred~.,data = CRF_West, control = cforest_unbiased(ntree=100))
varimp(object = cRF_W,conditional = T)
  • Your data downloads in a weird format. Did you save it as semi-colon separated, because that's not a great method for storing data. – Jason Apr 21 '16 at 12:26
  • Yes. Sorry, French version of Excel save .csv with semi-colon by default – Ariane Chabert Apr 21 '16 at 15:06
  • Because Excel behaves differently on different systems, R provides `read.csv()` (for the original comma-separated format) and `read.csv2()` (for the semicolon-separated format). In your case simply use: `read.csv2("Data.csv", row.names = 1)`. – Achim Zeileis Apr 24 '16 at 12:24

1 Answers1

0

Your subsample CRF_West is too small to yield any splits in the trees of the forest. The data has 23 observations from which bootstrap samples of about 2/3 are drawn for each tree. However, the minimal size of a node for splitting is 20 observations with a minimal node size of 7 observations, see ?ctree_control.

To force the trees/forest to split, you can use smaller values, e.g.

cRF_W <- cforest(Syrph_pred~.,data = CRF_West,
  control = cforest_unbiased(ntree=100, minsplit = 15, minbucket = 5))

For this forest you will get non-zero variable importances. Whether or not this will lead to particularly good/reliable results on such a small sample is a different question, though.

A final comment: Trying to make computations reproducible by setting a seed is good and very useful. However, using a (nonreproducible) random number for the random seed undermines the whole thing...

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49
  • Thank you so much ! I figured out that the sample size was too small but I was unable to understand why. I am trying to find another way to deal with this data without subseting. The thing is, Syrph_pred has a bimodal repartition which correspond to sectors (west and east)... As for the seed, I know. I randomly set seed just to test if different seeds leads to different results. A teacher advise me to do that once... – Ariane Chabert Apr 25 '16 at 09:12