-1

I am trying to obtain a prediction grid using the modelr::data_grid() function that has all levels for one factor of a model and only the "typical" level for the other factor in a model. The wrinkle is that the call levels() on the factor held constant in the prediction grid should return all levels present in the original data. Otherwise, some prediction functions can fail and return the infamous

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

Here is a demonstration of what I can get out of modelr::data_grid(), as well as what I would desire.

library(tidyverse)

co2_mod <- slice(CO2, -1, -43)
fit <- lm(uptake ~ Type + Treatment, data = co2_mod)

grid_df <- modelr::data_grid(co2_mod, Type, .model = fit) %>%
  mutate(Treatment = factor(Treatment))
grid_df
#> # A tibble: 2 x 2
#>   Type        Treatment
#>   <fct>       <fct>    
#> 1 Quebec      chilled  
#> 2 Mississippi chilled
levels(grid_df$Treatment)
#> [1] "chilled"

I would like to have

levels(grid_df$Treatment)
[1] "nonchilled" "chilled"

The factor variable with only 1 level in the dataset should "know" all of the possible factors in the original dataset. For example, see below the screen shot from the rstanarm vignettes that show the need to have both Male and Female as levels, even though predictions are made holding gender equal to Female.

Screenshot to rstanarm article


I have tried to write a couple of functions to accomplish my goal. There is an extra step of repeating the dataset 1000 times, since I am using this grid as newdata for posterior predictive distributions from an rstanarm model fit.

Auxiliary function (this works):

get_all_levels <- function(v, dataset){
  v <- enquo(v)
  
  a <- select(dataset, !!v) %>% as_vector()
  l <- levels(a)
  
  return(l)
}

Main function, which calls auxiliary function (this doesn't work):

grid_all_levels <- function(model, variable, df){
  variable <- enquo(variable)
  grid <- expand_grid(
    rep = seq(1, 1000, 1), 
    modelr::data_grid(df, !!variable, .model = model)
  ) %>%
    mutate(across(where(is.character), ~factor(., levels = get_all_levels(., df))))
  
  return(grid)
}

Here is a small reprex that you can run these functions on.

library(tidyverse)
  
get_all_levels <- function(v, dataset){
  v <- enquo(v)
  d <- as_tibble(dataset)
  a <- select(d, !!v) %>% as_vector()
  l <- levels(a)
  
  return(l)
}

grid_all_levels <- function(model, variable, df){
  variable <- enquo(variable)
  grid <- expand_grid(
    rep = seq(1, 1000, 1), 
    modelr::data_grid(df, !!variable, .model = model)
  ) %>%
    mutate(across(where(is.character), ~factor(., levels = get_all_levels(., df))))
  
  return(grid)
}

co2_mod <- slice(CO2, -1, -43)
fit <- lm(conc ~ Type + Treatment, data = co2_mod)
get_all_levels(Type, co2_mod)
#> [1] "Quebec"      "Mississippi"

grid_all_levels(fit, Type, co2_mod)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(.)` instead of `.` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> Error: Problem with `mutate()` input `..1`.
#> x Can't subset columns that don't exist.
#> x Columns `chilled`, `chilled`, `chilled`, `chilled`, `chilled`, etc. don't exist.
#> ℹ Input `..1` is `across(...)`.

I'm pretty sure that the issue is how I am using the . in the get_all_levels call in the grid_all_levels function.

How can I get grid_all_levels() to work?


Update

I modified the grid_all_levels() function per answers below to look like this:

grid_all_levels <- function(model, variable, df){
  variable <- enquo(variable)
  grid <- expand_grid(
    rep = seq(1, 1000, 1), 
    modelr::data_grid(df, !!variable, .model = model)
  ) %>%
    mutate(across(where(is.character), ~factor(., levels = get_all_levels(cur_column(), df))))
  
  return(grid)
}

which returns

#> # A tibble: 2,000 x 3
#>      rep Type        Treatment
#>    <dbl> <fct>       <fct>    
#>  1     1 Quebec      chilled  
#>  2     1 Mississippi chilled  
#>  3     2 Quebec      chilled  
#>  4     2 Mississippi chilled  
#>  5     3 Quebec      chilled  
#>  6     3 Mississippi chilled  
#>  7     4 Quebec      chilled  
#>  8     4 Mississippi chilled  
#>  9     5 Quebec      chilled  
#> 10     5 Mississippi chilled  
#> # … with 1,990 more rows
levels(a$Treatment)
#> [1] "nonchilled" "chilled"

Which was exactly what I wanted!

2 Answers2

1

The problem is that the . inside the across is the actual column not the name of the column. Sadly there is no way to get the names inside an across call however using purrr::reduce we can make a workaround:

library(tidyverse)
levelize <- function(.x, .y){
 mutate(.x, !!paste0(.y) := factor(!!sym(.y), levels =  get_all_levels(!!sym(.y), df) ) )
}

estimate_prevalence_differences_polr <- function(model, variable, df){
  variable <- enquo(variable)
  # putting the names of the factor variables inside a vector
  vars <- sapply(df, function(x) is.factor(x))
  vars <- names(vars)[vars]
  expand_grid(
    rep = 1:1000, 
    modelr::data_grid(df, !!variable, .model = model)
  ) %>% 
### looping through the vars that are factor
## the .x argument is the one set by .init i.e the grid
## the .y argument is the element of the vars vector in the current iter
## the first iter we have .y="Type" and .x=.
## inside the levelize function
## basically mutate .x and change the variable named .y (Type) to become a factor
## with the levels returned from the original df
    reduce(intersect(vars, colnames(.)), levelize, .init=.)
}

After calling the function now the levels are set correctly, we can check this by printing the unique values of each factor column:

estimate_prevalence_differences_polr(fit, Type, df) %>% select(where(is.factor)) %>% map(unique)
$Type
[1] Quebec      Mississippi
Levels: Quebec Mississippi

$Plant
[1] Qc2
Levels: Qn1 Qn2 Qn3 Qc1 Qc3 Qc2 Mn3 Mn2 Mn1 Mc2 Mc3 Mc1

$Treatment
[1] chilled    nonchilled
Levels: nonchilled chilled

EDIT :

a way easier solution is to use cur_column I didn't now about this option when I first wrote the answer. so basically instead of passing . to get_all_levels pass cur_column() i.e

expand_grid(
    rep = seq(1, 1000, 1), 
    modelr::data_grid(df, !!variable, .model = model)
  ) %>%
    mutate(across(where(is.character), ~factor(., levels = get_all_levels(cur_column(), df))))
Abdessabour Mtk
  • 3,895
  • 2
  • 14
  • 21
  • @MatthewLoop this does give the desired output, as I show in the checking section plan has only one value but has all the levels – Abdessabour Mtk Oct 24 '20 at 19:15
  • @MatthewLoop can you at least explain how this answer isn't valid – Abdessabour Mtk Oct 24 '20 at 19:54
  • I just retried the code and the exand.grid doesn't contain a plant column, also I'd suggest using the approach in the edit + and replace tha get_all_levelsfunction with the simple command `levels(df[[cur_colum()]])` – Abdessabour Mtk Oct 27 '20 at 21:11
  • I'm failing to understand the problem. it clearly states in your question that you want all the factor variables to have all the levels found in the original data, if this is not what you seek, then can you put what you want in simple words. – Abdessabour Mtk Oct 28 '20 at 01:14
  • I greatly modified the question, showed the updated output that gives the correct answer, and accepted your answer as correct. The edited version worked, as you said. – Matthew Loop Oct 29 '20 at 13:57
  • @MatthewLoop the problem you had is due to `df` not being defined in ` get_all_levels(!!sym(.y), df)` – Abdessabour Mtk Oct 29 '20 at 15:15
0

I think you are making it more complicated than it needs. If I remove all the tidyeval stuff, we get this which is much simpler:

mutate(across(where(is.character), ~ factor(., levels = levels(.))))

You can extract the ~ lambda into a reusable named function.

This implementation immediately brings issues to mind though. Since levels() returns NULL with character vectors, this only works with factors. And with factors this is a no-op. I'm not sure what you're trying to achieve.

Also, the get-all-levels implementation has a problem: it takes a selection, which can return multiple vectors, e.g. with starts_with(). In this case you will get a confusing error.

Lionel Henry
  • 6,652
  • 27
  • 33