I am trying to obtain a prediction grid using the modelr::data_grid()
function that has all levels for one factor of a model and only the "typical" level for the other factor in a model. The wrinkle is that the call levels()
on the factor held constant in the prediction grid should return all levels present in the original data. Otherwise, some prediction functions can fail and return the infamous
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Here is a demonstration of what I can get out of modelr::data_grid()
, as well as what I would desire.
library(tidyverse)
co2_mod <- slice(CO2, -1, -43)
fit <- lm(uptake ~ Type + Treatment, data = co2_mod)
grid_df <- modelr::data_grid(co2_mod, Type, .model = fit) %>%
mutate(Treatment = factor(Treatment))
grid_df
#> # A tibble: 2 x 2
#> Type Treatment
#> <fct> <fct>
#> 1 Quebec chilled
#> 2 Mississippi chilled
levels(grid_df$Treatment)
#> [1] "chilled"
I would like to have
levels(grid_df$Treatment)
[1] "nonchilled" "chilled"
The factor variable with only 1 level in the dataset should "know" all of the possible factors in the original dataset. For example, see below the screen shot from the rstanarm
vignettes that show the need to have both Male
and Female
as levels, even though predictions are made holding gender
equal to Female
.
I have tried to write a couple of functions to accomplish my goal. There is an extra step of repeating the dataset 1000 times, since I am using this grid as newdata
for posterior predictive distributions from an rstanarm
model fit.
Auxiliary function (this works):
get_all_levels <- function(v, dataset){
v <- enquo(v)
a <- select(dataset, !!v) %>% as_vector()
l <- levels(a)
return(l)
}
Main function, which calls auxiliary function (this doesn't work):
grid_all_levels <- function(model, variable, df){
variable <- enquo(variable)
grid <- expand_grid(
rep = seq(1, 1000, 1),
modelr::data_grid(df, !!variable, .model = model)
) %>%
mutate(across(where(is.character), ~factor(., levels = get_all_levels(., df))))
return(grid)
}
Here is a small reprex that you can run these functions on.
library(tidyverse)
get_all_levels <- function(v, dataset){
v <- enquo(v)
d <- as_tibble(dataset)
a <- select(d, !!v) %>% as_vector()
l <- levels(a)
return(l)
}
grid_all_levels <- function(model, variable, df){
variable <- enquo(variable)
grid <- expand_grid(
rep = seq(1, 1000, 1),
modelr::data_grid(df, !!variable, .model = model)
) %>%
mutate(across(where(is.character), ~factor(., levels = get_all_levels(., df))))
return(grid)
}
co2_mod <- slice(CO2, -1, -43)
fit <- lm(conc ~ Type + Treatment, data = co2_mod)
get_all_levels(Type, co2_mod)
#> [1] "Quebec" "Mississippi"
grid_all_levels(fit, Type, co2_mod)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(.)` instead of `.` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> Error: Problem with `mutate()` input `..1`.
#> x Can't subset columns that don't exist.
#> x Columns `chilled`, `chilled`, `chilled`, `chilled`, `chilled`, etc. don't exist.
#> ℹ Input `..1` is `across(...)`.
I'm pretty sure that the issue is how I am using the .
in the get_all_levels
call in the grid_all_levels
function.
How can I get grid_all_levels()
to work?
Update
I modified the grid_all_levels()
function per answers below to look like this:
grid_all_levels <- function(model, variable, df){
variable <- enquo(variable)
grid <- expand_grid(
rep = seq(1, 1000, 1),
modelr::data_grid(df, !!variable, .model = model)
) %>%
mutate(across(where(is.character), ~factor(., levels = get_all_levels(cur_column(), df))))
return(grid)
}
which returns
#> # A tibble: 2,000 x 3
#> rep Type Treatment
#> <dbl> <fct> <fct>
#> 1 1 Quebec chilled
#> 2 1 Mississippi chilled
#> 3 2 Quebec chilled
#> 4 2 Mississippi chilled
#> 5 3 Quebec chilled
#> 6 3 Mississippi chilled
#> 7 4 Quebec chilled
#> 8 4 Mississippi chilled
#> 9 5 Quebec chilled
#> 10 5 Mississippi chilled
#> # … with 1,990 more rows
levels(a$Treatment)
#> [1] "nonchilled" "chilled"
Which was exactly what I wanted!