Cannot run ANOVA to Compare Random Forest Models

Question

I am using tidymodels to fit multiple Random Forest models. I then followed along with this tutorial to compare the model results. The problem is that I get the error: Error in

 UseMethod("anova") : 
  no applicable method for 'anova' applied to an object of class "ranger"

As an example:

set.seed(123)
iris <- iris %>% mutate(
  is_versicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor")) %>%
  mutate(is_versicolor = factor(is_versicolor, levels = c("versicolor", "not_versicolor")))

iris_split <- initial_split(iris, strata = is_versicolor, prop = 0.8)
iris_train <- training(iris_split)
iris_test  <- testing(iris_split)

rec_normal <- recipe(is_versicolor ~ Petal.Width + Species, data = iris_train)
rec_interaction <- rec_normal %>% 
  step_interact(~ Petal.Width:starts_with("Species"))

iris_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification")

# normal workflow
iris_wf <- workflow() %>% 
  add_model(iris_model) %>% 
  add_recipe(rec_normal)

# interaction workflow
iris_wf_interaction <- iris_wf %>% 
  update_recipe(rec_interaction)

# fit models
iris_normal_lf <- last_fit(iris_wf, split = iris_split)
iris_inter_lf <- last_fit(iris_wf_interaction, split = iris_split)

normalmodel <- iris_normal_lf %>% extract_fit_engine()
intermodel  <- iris_inter_lf %>% extract_fit_engine()

anova(normalmodel, intermodel) %>% tidy()

How can I run an ANOVA or ANOVA-type comparison of these models, to see if one is significantly better?

Isaiah · Answer 1 · 2022-12-12T12:32:14.427

Just using your code, and adapting Julia Silge's blog on workflowsets:

Predict #TidyTuesday giant pumpkin weights with workflowsets

As ANOVA is not available for ranger, instead generate folds to resample:

set. Seed(234)
iris_folds <- vfold_cv(iris_train)
iris_folds

Combine your recipes into a workflowset:

iris_set <-
  workflow_set(
    list(rec_normal, rec_interaction),
    list(iris_model),
    cross = TRUE
  )

iris_set

Setup parallel processing:

doParallel::registerDoParallel()
set. Seed(2021)

Fit using the folds:

iris_rs <-
  workflow_map(
    iris_set,
    "fit_resamples",
    resamples = iris_folds
  )

autoplot(iris_rs)

This chart would usually address your question of how to compare models.

As "species" is on the righthand side of both recipe formulas, and the response "is_versicolor" is calculated from species, the models are completely accurate.

Finish off the output:

collect_metrics(iris_rs)

final_fit <-
  extract_workflow(iris_rs, "recipe_2_rand_forest") %>%
  fit(iris_train)

There is no tidier for ranger models.

In your code, if you change to:

rec_normal <- recipe(is_versicolor ~ Sepal.Length + Sepal.Width, data = iris_train)
rec_interaction <- recipe(is_versicolor ~ Petal.Width + Petal.Length, data = iris_train)

you can have some fun!

Hope this helps Adam. Just learning the wonderful Tidymodels like you, and look forward to comments. :-)

Thanks! This is helpful. So how do I convert those to something like a p-value? — Adam_G, Dec 12 '22 at 22:27
Using something like performance metrics doesn't give you a p-value; instead it gives you something like a mean and variance on the metric, to compare between two model configurations. If you want something more like an ANOVA, you could try using `lme4::lmer()` on your resamples. Check out the [internals of `tune_race_anova()`](https://github.com/tidymodels/finetune/blob/main/R/racing_helpers.R#L516) to see a possible approach. — Julia Silge, Dec 12 '22 at 22:56
Restrictive, but you could set your engine to GLM: `iris_model <- logistic_reg() |> set_engine("glm") |> set_mode("classification")` — Isaiah, Dec 13 '22 at 01:33
Thanks @JuliaSilge! (btw, loved your NormConf talk; trying to use as inspiration here :-) — Adam_G, Dec 18 '22 at 17:33

Quinten · Answer 2 · 2022-12-17T09:48:03.773

You could compare your random forest models by comparing their accuracies using the aov function. First, you can collect the accuracy with collect_metrics and save them in a data frame to run a model with aov to get the results. Here is a reproducible example:

library(tidymodels)
set.seed(123)
iris <- iris %>% mutate(
  is_versicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor")) %>%
  mutate(is_versicolor = factor(is_versicolor, levels = c("versicolor", "not_versicolor")))

iris_split <- initial_split(iris, strata = is_versicolor, prop = 0.8)
iris_train <- training(iris_split)
iris_test  <- testing(iris_split)

rec_normal <- recipe(is_versicolor ~ Petal.Width + Species, data = iris_train)
rec_interaction <- rec_normal %>% 
  step_interact(~ Petal.Width:starts_with("Species"))

iris_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification")

# normal workflow
iris_wf <- workflow() %>% 
  add_model(iris_model) %>% 
  add_recipe(rec_normal)

# interaction workflow
iris_wf_interaction <- iris_wf %>% 
  update_recipe(rec_interaction)

# fit models
iris_normal_lf <- last_fit(iris_wf, split = iris_split)
iris_inter_lf <- last_fit(iris_wf_interaction, split = iris_split)
#> ! train/test split: preprocessor 1/1: Categorical variables used in `step_interact` should probably be avoided...

normalmodel <- iris_normal_lf %>% extract_fit_engine()
intermodel  <- iris_inter_lf %>% extract_fit_engine()

# Check confusion matrix
iris_normal_lf %>%
  collect_predictions() %>% 
  conf_mat(is_versicolor, .pred_class) 
#>                 Truth
#> Prediction       versicolor not_versicolor
#>   versicolor             10              0
#>   not_versicolor          0             20

iris_inter_lf %>%
  collect_predictions() %>% 
  conf_mat(is_versicolor, .pred_class) 
#>                 Truth
#> Prediction       versicolor not_versicolor
#>   versicolor             10              0
#>   not_versicolor          0             20

# Extract accuracy of models and create dataset
acc_normalmodel <- iris_normal_lf %>% collect_metrics() %>% select(.estimate) %>% slice(1)
acc_intermodel <- iris_normal_lf %>% collect_metrics() %>% select(.estimate) %>% slice(1)
results = data.frame(model = c("normalmodel", "intermodel"),
                     accuracy = c(acc_normalmodel$.estimate, acc_intermodel$.estimate))

# perform ANOVA on the classification accuracy
aov_results <- aov(accuracy ~ model, data = results)
summary(aov_results)
#>             Df   Sum Sq  Mean Sq
#> model        1 4.93e-32 4.93e-32

^{Created on 2022-12-15 with reprex v2.0.2}

As you can see the results doesn't show a p-value, because the degree of freedom is to low (why do I not get a p-value from this anova in r)

You could also use the aov on the predictions of both models and compare these performance. Here is a reproducible example:

# Get predictions of both models for not_versicolor
normalmodel_pred<-as.data.frame(normalmodel$predictions)$not_versicolor
intermodel_pred<-as.data.frame(intermodel$predictions)$not_versicolor

summary(aov(normalmodel_pred~intermodel_pred))
#>                  Df Sum Sq Mean Sq F value Pr(>F)    
#> intermodel_pred   1 25.032  25.032    9392 <2e-16 ***
#> Residuals       118  0.314   0.003                   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

^{Created on 2022-12-17 with reprex v2.0.2}

As you can see the p-value is less than 0.05 which suggests that there is a difference between the predictions of the models, which is right if you look at the probabilities of the predictions.

More information about ANOVA check this:

Chapter 7 Understanding ANOVA in R

Hi @Adam_G, yes sure! The thing is that with only two models you only have two accuracy values. Which is too low in degrees of freedom for anova. An option could be compare the means of each tree in the random forest. — Quinten, Dec 16 '22 at 10:24
Can you show how to do that then? Unfortunately without that, this doesn't really answer the question. — Adam_G, Dec 16 '22 at 19:34
@Adam_G, I added some code to perform aov on the predictions of both models, which returns a significance. — Quinten, Dec 17 '22 at 09:48
The model is a classification model, so the raw predictions get converted to not_versicolor or versicolor. The classification accuracy of the models is identical, both being 100%. You can see this with `round(normalmodel[["predictions"]][,1])-intermodel[["predictions"]][,1] ` which is all zeros. As a classification model, there is no difference between the two models. — Isaiah, Dec 17 '22 at 13:56
Hi @Isaiah, that is true, but the probability of a classification per model could be different right? — Quinten, Dec 17 '22 at 14:17
`rec_normal <- recipe(is_versicolor ~ Sepal.Width, data = iris_train) rec_alternative <- recipe(is_versicolor ~ Sepal.Length, data = iris_train)` is a more useful pair of models to work with, as the classification accuracy is not 100% — Isaiah, Dec 17 '22 at 14:23
Hi @Quinten! Good point. Would depend on the data set too. I had a go at using resamples and extracting the accuracies, then doing an ANOVA, which extends your original idea of ANOVA on the accuracies. Very interested in your comments. — Isaiah, Dec 17 '22 at 14:40

Isaiah · Answer 3 · 2022-12-21T09:38:46.217

Using a different model pair, and comparing models based on classification accuracy using resamples. Easily extended to other metrics.

library(dplyr)
library(tibble)
library(ggplot2)
library(tidyr)
library(rsample)
library(recipes)
library(parsnip)
library(workflows)
library(tune)
library(yardstick)
library(workflowsets)

set.seed(123)
iris <- iris %>% mutate(
  is_versicolor = ifelse(Species == "versicolor", "versicolor", "not_versicolor")) %>%
  mutate(is_versicolor = factor(is_versicolor, levels = c("versicolor", "not_versicolor")))

iris_split <- initial_split(iris, strata = is_versicolor, prop = 0.8)
iris_train <- training(iris_split)
iris_test  <- testing(iris_split)

# replacing normal and interaction recipes with models
# that give less than 100% accuracy.
rec_normal <- recipe(is_versicolor ~ Sepal.Width, data = iris_train)
rec_alternative <- recipe(is_versicolor ~ Sepal.Length, data = iris_train)

iris_model <- rand_forest() %>% set_engine("ranger") %>% set_mode("classification")

# Create folds
set.seed(234)
iris_folds <- vfold_cv(iris_train)
iris_folds

# Combine models into set
iris_set <-
  workflow_set(
    list(rec_normal, rec_alternative),
    list(iris_model),
    cross = TRUE
  )

doParallel::registerDoParallel()
set.seed(2021)

# fit models
iris_rs <-
  workflow_map(
    iris_set,
    "fit_resamples",
    resamples = iris_folds
  )

# Visualise model performance
autoplot(iris_rs)

# Extract resample accuracies
model_1_rs <- iris_rs[1,][[4]][[1]]$.metrics
model_2_rs <- iris_rs[2,][[4]][[1]]$.metrics
model_acc <- tibble(model_1 = NA, model_2 = NA)
for (i in 1:10) {
  model_acc[i, 1] <- model_1_rs[[i]][[".estimate"]][1]
  model_acc[i, 2] <- model_2_rs[[i]][[".estimate"]][1]
}
model_acc <- model_acc |> pivot_longer(cols = everything(), names_to = "model", values_to = "acc")

# Do ANOVA
aov_results <- aov(acc ~ model, data = model_acc)
summary(aov_results)
ggplot(data = model_acc, aes(fill = model)) +
  geom_density(aes(x = acc, alpha = 0.2)) +
  labs(x = "accuracy")

Giving the p values:

> summary(aov_results)

            Df Sum Sq Mean Sq F value Pr(>F)

model        1 0.0281 0.02813   1.378  0.256

Residuals   18 0.3674 0.02041

Looking at the p values of the model accuracies using a different lens: First visualise the variation:

model_acc |> ggplot(aes(x = model, y = acc)) +
  geom_boxplot() +
  labs(y = 'accuracy')

Then calculate a test statistic:

observed_statistic <- model_acc %>%
  specify(acc ~ model) %>%
  calculate(stat = "diff in means", order = c("model_1", "model_2"))

observed_statistic

Then do a simulation of the distribution:

null_dist_2_sample <- model_acc %>%
  specify(acc ~ model) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means" ,order = c("model_1", "model_2"))

and plot:

null_dist_2_sample %>%
  visualize() + 
  shade_p_value(observed_statistic,
                direction = "two-sided") +
  labs(x = "test statistic")

and get the p value:

p_value_2_sample <- null_dist_2_sample %>%
  get_p_value(obs_stat = observed_statistic,
              direction = "two-sided")

p_value_2_sample

# A tibble: 1 × 1
  p_value
    <dbl>
1   0.228

Which is almost the same as the p value from the aov. Note that consistent with the accuracies of the two models being close, the p value is high.

Added a second calculation of p values. – Isaiah Dec 21 '22 at 09:19 — Isaiah, Dec 21 '22 at 09:19

Cannot run ANOVA to Compare Random Forest Models

3 Answers3