1

Apologies in advance for no data samples:

I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. I then compared the AUC of this forest against a forest already built and predicting on cases. What I want to figure out is how to determine what exactly is lending predictive power to this new forest. Univariate analysis with the outcome variable led to no significant findings. Any technique recommendations would be greatly appreciated.

EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power.

Jonathan Rauscher
  • 147
  • 1
  • 1
  • 15
  • Just to clarify, are you looking to find variables or model hyperparameters that improve model accuracy. – Will Aug 28 '17 at 14:04
  • @WilliamAshford He mentioned no tuning, so I would guess he meant how to find the most predictive variables? – acylam Aug 28 '17 at 14:33
  • How many variables are we talking? Could be asking for the importance of hyper parameters without tuning for the general case though. – Will Aug 28 '17 at 14:41
  • @WilliamAshford as stated in the OP, we're working with 4 explanatory continuous variables. – Jonathan Rauscher Aug 28 '17 at 14:44
  • @WilliamAshford As I understand it, random forest doesn't need much hyperparameter tuning other than adjusting the number of trees, so variable importance might be more relevant in this case? OP please reiterate what you're trying to ask... – acylam Aug 28 '17 at 14:51
  • @useR Question updated for clarification. – Jonathan Rauscher Aug 28 '17 at 14:54

1 Answers1

3

Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions.

Here are some discussions on variable importance plots, also here and here. It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp(). The interpretation of this plot depends on the metric you are using to assess variable importance. For example if you use MeanDecreaseAccuracy, a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount.

Here are some other discussions on partial dependence plots for predictive models, also here. It is implemented in the randomForest package as partialPlot().

In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. And compare it's performance against a random forest. See this discussion about variable selection. It is implemented in the glmnet package. Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet: How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?

acylam
  • 18,231
  • 5
  • 36
  • 45
  • Thanks for the response. I'll attempt your suggestion of binary logistic regression and report the outcome. By the way, what would the parameter be for L2 regularization in `lrm` from the `rms` library? Or would this be something done mathematically? – Jonathan Rauscher Aug 28 '17 at 15:52
  • @JonathanRauscher I'm not familiar with the `rms` package, but see my updated answer for more details. – acylam Aug 28 '17 at 16:06
  • I tried your suggestion and the logistic regression was much weaker than the random forest. Thoughts? – Jonathan Rauscher Aug 28 '17 at 18:20
  • @JonathanRauscher In that case, you may just want to use the logistic regression for interpretation and the random forest for prediction. And use the plots mentioned above to assess variable importance. Ultimately, it depends on whether you are more concerned about interpretability or prediction. Random forest is no doubt better in prediction, but logsitic regression is easier to interpret. – acylam Aug 28 '17 at 18:44