I would like to compare, using tidymodels and cross-validation, 3 linear regression models that can be specified as the following:
- (model_A)
y ~ a
- (model_B)
y ~ b
- (model_AB)
y ~ a + b
In the following y
will denote the target variable, while a
and b
will denote independent variables.
Without using cross validation it is (I hope) quite clear to me what I have to do:
- Split my data into train and test set
set.seed(1234)
split <- data %>% initial_split(strata = y)
data_train <- training(split)
data_test <- training(split)
- I can specify, fit, and evaluate my model in one go (for example for model_AB)
linear_reg() %>%
set_engine("lm") %>%
fit(y ~ a + b, data = data_train) %>%
augment(new_data = data_test) %>%
rmse(truth = y, estimate = .pred)
The output looks something like this:
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard x.xxx
I can repeat step 2 for the other two models and compare the three models based on the RMSE metric (since this is the choice for this example).
For example I can create a dummy dataset and run the steps described above.
library(tidyverse)
library(tidymodels)
set.seed(1234)
n <- 1e4
data <- tibble(a = rnorm(n),
b = rnorm(n),
y = 1 + 3*a - 2*b + rnorm(n))
set.seed(1234)
split <- data %>% initial_split(strata = y)
data_train <- training(split)
data_test <- training(split)
- Model_A
linear_reg() %>%
set_engine("lm") %>%
fit(y ~ a, data = data_train) %>%
augment(new_data = data_test) %>%
rmse(truth = y, estimate = .pred)
result
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 2.23
- Model_B
linear_reg() %>%
set_engine("lm") %>%
fit(y ~ b, data = data_train) %>%
augment(new_data = data_test) %>%
rmse(truth = y, estimate = .pred)
result
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 3.17
- Model_AB
linear_reg() %>%
set_engine("lm") %>%
fit(y ~ a + b, data = data_train) %>%
augment(new_data = data_test) %>%
rmse(truth = y, estimate = .pred)
result
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.00
My question is: how can I evaluate the RMSE after performing cross validation on three models that differ by the list of possible features?
In this video Julia Silge does the job with three different models (logistic regression, knn, and decision trees) using the same set of predictors. However what I aim to do is to compare models that differ in the set of predictors.
Any suggestion and/or reference?