Fitting several regression models by changing only one independent variable within mutate()

Question

I suspect that this question might be a duplicate, however, I found nothing satisfactory. Imagine a simple dataset with a structure like this:

set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
                 cov_b = rbinom(100, 1, prob = 0.5),
                 cont_a  = runif(100),
                 cont_b = runif(100),
                 dep = runif(100))

    cov_a cov_b      cont_a      cont_b          dep
1       0     1 0.238726027 0.784575267 0.9860542973
2       1     0 0.962358936 0.009429905 0.1370674714
3       0     0 0.601365726 0.779065883 0.9053095817
4       1     1 0.515029727 0.729390652 0.5763018376
5       1     0 0.402573342 0.630131853 0.3954488591
6       0     1 0.880246541 0.480910830 0.4498024841
7       1     1 0.364091865 0.156636851 0.7065019011
8       1     1 0.288239281 0.008215520 0.0825027458
9       1     0 0.170645235 0.452458394 0.3393125802
10      0     0 0.172171746 0.492293329 0.6807875512

What I'm looking for is an elegant dplyr/tidyverse option to fit a separate regression model for every cov_ variable, while including the same set of additional variables and the same dependent variable.

I'm able to solve this problem using this code (require purrr, dplyr, tidyr and broom):

map(.x = names(df)[grepl("cov_", names(df))],
    ~ df %>%
     nest() %>%
     mutate(res = map(data, function(y) tidy(lm(dep ~ cont_a + cont_b + !!sym(.x), data = y)))) %>%
     unnest(res))

[[1]]
# A tibble: 4 x 6
  data               term        estimate std.error statistic      p.value
  <list>             <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 <tibble [100 × 5]> (Intercept)   0.472     0.0812     5.81  0.0000000799
2 <tibble [100 × 5]> cont_a       -0.103     0.0983    -1.05  0.296       
3 <tibble [100 × 5]> cont_b        0.172     0.0990     1.74  0.0848      
4 <tibble [100 × 5]> cov_a        -0.0455    0.0581    -0.783 0.436       

[[2]]
# A tibble: 4 x 6
  data               term        estimate std.error statistic     p.value
  <list>             <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 <tibble [100 × 5]> (Intercept)   0.415     0.0787     5.27  0.000000846
2 <tibble [100 × 5]> cont_a       -0.0874    0.0984    -0.888 0.377      
3 <tibble [100 × 5]> cont_b        0.181     0.0980     1.84  0.0682     
4 <tibble [100 × 5]> cov_b         0.0482    0.0576     0.837 0.405

However, I would like to avoid the use of double-map() and solve it by using a somehow more direct or elegant approach.

latlio · Accepted Answer · 2021-01-11T14:13:17.983

I'm not sure if this will be considered more direct/elegant but here is my solution that does not use a double map:

library(tidyverse)
library(broom)

gen_model_expr <- function(var) {
  form = paste("dep ~ cont_a + cont_b +", var)
  tidy(lm(as.formula(form), data = df))
}

grep("cov_", names(df), value = TRUE) %>%
  map(gen_model_expr)

Output (Note that it does not retain the data column):

[[1]]
# A tibble: 4 x 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)   0.472     0.0812     5.81  0.0000000799
2 cont_a       -0.103     0.0983    -1.05  0.296       
3 cont_b        0.172     0.0990     1.74  0.0848      
4 cov_a        -0.0455    0.0581    -0.783 0.436       

[[2]]
# A tibble: 4 x 5
  term        estimate std.error statistic     p.value
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)   0.415     0.0787     5.27  0.000000846
2 cont_a       -0.0874    0.0984    -0.888 0.377      
3 cont_b        0.181     0.0980     1.84  0.0682     
4 cov_b         0.0482    0.0576     0.837 0.405

EDIT

In the interest of speed performance (inspired by @TimTeaFan), a benchmark test comparing the different ways to retrieve the covariate names is shown below. grep("cov_", names(df), value = TRUE) is the fastest

# A tibble: 4 x 13
  expression                                         min median `itr/sec` mem_alloc
  <bch:expr>                                      <bch:> <bch:>     <dbl> <bch:byt>
1 names(df)[grepl("cov_", names(df))]             7.59µs  8.4µs   101975.        0B
2 grep("cov_", colnames(df), value = TRUE)        8.21µs 8.96µs   103142.        0B
3 grep("cov_", names(df), value = TRUE)           6.96µs 7.43µs   128694.        0B
4 df %>% select(starts_with("cov_")) %>% colnames 1.17ms 1.33ms      636.    5.39KB

A note on the names selection benchmarks: adding `fixed = TRUE` avoids using regular expressions (not needed here) and is about 3/4 times faster. — Laurent Bergé, Jan 14 '21 at 09:11

Laurent Bergé · Answer 2 · 2021-01-12T09:45:26.730

Not directly tidyverse-based but: multiple estimations can be performed at once with the package fixest.

The syntax is explicit and the solution to the question asked can be written in one line of code:

library(fixest)

set.seed(123)
n = 100
df = data.frame(cov_a = rbinom(n, 1, prob = 0.5),
                cov_b = rbinom(n, 1, prob = 0.5),
                cont_a  = runif(n), cont_b = runif(n),
                dep = runif(n))

# Estimation: sw means stepwise
res = feols(dep ~ sw(cov_a, cov_b) + cont_a + cont_b, df)

That's it: the two estimations are done. (Note that feols is like lm.)

Then you can display the results:

# Display the results
etable(res, order = "Int|cov")
#>                            model 1            model 2
#> Dependent Var.:                dep                dep
#>                                                      
#> (Intercept)     0.4722*** (0.0812) 0.4148*** (0.0788)
#> cov_a             -0.0455 (0.0581)                   
#> cov_b                                 0.0482 (0.0576)
#> cont_a            -0.1033 (0.0983)   -0.0874 (0.0984)
#> cont_b            0.1723. (0.0990)   0.1808. (0.0980)
#> _______________ __________________ __________________
#> S.E. type                 Standard           Standard
#> Observations                   100                100
#> R2                         0.04823            0.04910
#> Adj. R2                    0.01849            0.01938

And easily get them to the tidyverse:

# Get the results into tidyverse
library(broom)
lapply(as.list(res), function(x) tidy(x))
#> [[1]]
#> # A tibble: 4 x 5
#>   term        estimate std.error statistic      p.value
#>   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 (Intercept)   0.472     0.0812     5.81  0.0000000799
#> 2 cov_a        -0.0455    0.0581    -0.783 0.436       
#> 3 cont_a       -0.103     0.0983    -1.05  0.296       
#> 4 cont_b        0.172     0.0990     1.74  0.0848      
#> 
#> [[2]]
#> # A tibble: 4 x 5
#>   term        estimate std.error statistic     p.value
#>   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
#> 1 (Intercept)   0.415     0.0787     5.27  0.000000846
#> 2 cov_b         0.0482    0.0576     0.837 0.405      
#> 3 cont_a       -0.0874    0.0984    -0.888 0.377      
#> 4 cont_b        0.181     0.0980     1.84  0.0682

Here's the documentation on multiple estimations. You may also, still at once, apply multiple estimations to dependent variables/fixed-effects (if appropriate)/split sample--all compactly.

Last thing: multiple estimations are highly optimized, so you'll get high performance gains vis à vis loop based solutions (like map).

TimTeaFan · Answer 3 · 2021-01-10T00:16:21.450

I throw in two approaches:

The first one is rather boring. We can use {dplyr}'s rowwise notation instead of purrr::map. This approach comes in two flavors. After rowwise we can use either (A) mutate %>% unnest or, (B), we can use group_map. In both approaches I refrained from duplicating the data, but we can easily include the data in each row if needed (when setting up the the tibble we can do tibble(myvar = ., data = list(df))). While (A) gives us one tibble that includes all the data, the group_map in (B) returns a list similar to the "double map" approach from the original example.

The second approach I'd consider as rather "fresh" (although less straight-forward), since it neither uses rowwise nor map. Here we use {dplyr}'s across function in combination with cur_column(), create a new column for each output and then pivot_longer and unnest to have all the results in one tibble.

The final benchmark shows: "doulbe_map" is slowest (because of the duplicate data column), followed by "across" and "row_unnest", while "row_group_map" is rather fast. Nevertheless, the fastest approach is the one by @latlio using map and a custom function (bellow called "map_custom_fun"), but although it is using purrr, it might be less "dplyr-ish".

library(tidyverse)
library(broom)

set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
                 cov_b = rbinom(100, 1, prob = 0.5),
                 cont_a  = runif(100),
                 cont_b = runif(100),
                 dep = runif(100))

# first approach: using dplyr rowwise

# Variation A:
# rowwise %>% mutate %>% unnest
df %>% 
  select(starts_with("cov_")) %>% 
  colnames %>%
  tibble(myvar = .) %>%
  rowwise() %>% 
  mutate(res = list(tidy(lm(dep ~ cont_a + cont_b + eval(sym(.data$myvar)), data = df)))) %>% 
  unnest(res)
#> # A tibble: 8 x 6
#>   myvar term                   estimate std.error statistic      p.value
#>   <chr> <chr>                     <dbl>     <dbl>     <dbl>        <dbl>
#> 1 cov_a (Intercept)              0.472     0.0812     5.81  0.0000000799
#> 2 cov_a cont_a                  -0.103     0.0983    -1.05  0.296       
#> 3 cov_a cont_b                   0.172     0.0990     1.74  0.0848      
#> 4 cov_a eval(sym(.data$myvar))  -0.0455    0.0581    -0.783 0.436       
#> 5 cov_b (Intercept)              0.415     0.0787     5.27  0.000000846 
#> 6 cov_b cont_a                  -0.0874    0.0984    -0.888 0.377       
#> 7 cov_b cont_b                   0.181     0.0980     1.84  0.0682      
#> 8 cov_b eval(sym(.data$myvar))   0.0482    0.0576     0.837 0.405

# Variation B:
# rowwise %>% group_map
df %>% 
  select(starts_with("cov_")) %>% 
  colnames %>%
  tibble(myvar = .) %>%
  rowwise() %>% 
  group_map(.keep = TRUE,
            ~ tidy(lm(dep ~ cont_a + cont_b + eval(sym(myvar)), data = df)))
#> [[1]]
#> # A tibble: 4 x 5
#>   term                estimate std.error statistic      p.value
#>   <chr>                  <dbl>     <dbl>     <dbl>        <dbl>
#> 1 (Intercept)           0.472     0.0812     5.81  0.0000000799
#> 2 cont_a               -0.103     0.0983    -1.05  0.296       
#> 3 cont_b                0.172     0.0990     1.74  0.0848      
#> 4 eval(sym(.x$myvar))  -0.0455    0.0581    -0.783 0.436       
#> 
#> [[2]]
#> # A tibble: 4 x 5
#>   term                estimate std.error statistic     p.value
#>   <chr>                  <dbl>     <dbl>     <dbl>       <dbl>
#> 1 (Intercept)           0.415     0.0787     5.27  0.000000846
#> 2 cont_a               -0.0874    0.0984    -0.888 0.377      
#> 3 cont_b                0.181     0.0980     1.84  0.0682     
#> 4 eval(sym(.x$myvar))   0.0482    0.0576     0.837 0.405


# second approach: using summarise(across)
# we need a `tibble` here, otherwise printing gets messed up
df_tbl <- as_tibble(df)

df_tbl %>% 
  summarise(across(starts_with("cov_"), 
                   ~ list(tidy(lm(
                     reformulate(c("cont_a","cont_b", cur_column()), "dep"),
                     data = df_tbl))))) %>% 
  pivot_longer(cols = everything(),
               names_to = "var",
               values_to = "res") %>% 
  unnest(res)
#> # A tibble: 8 x 6
#>   var   term        estimate std.error statistic      p.value
#>   <chr> <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 cov_a (Intercept)   0.472     0.0812     5.81  0.0000000799
#> 2 cov_a cont_a       -0.103     0.0983    -1.05  0.296       
#> 3 cov_a cont_b        0.172     0.0990     1.74  0.0848      
#> 4 cov_a cov_a        -0.0455    0.0581    -0.783 0.436       
#> 5 cov_b (Intercept)   0.415     0.0787     5.27  0.000000846 
#> 6 cov_b cont_a       -0.0874    0.0984    -0.888 0.377       
#> 7 cov_b cont_b        0.181     0.0980     1.84  0.0682      
#> 8 cov_b cov_b         0.0482    0.0576     0.837 0.405

^{Created on 2021-01-10 by the reprex package (v0.3.0)}

Benchmark

#> Unit: milliseconds
#>            expr      min        lq      mean    median        uq      max neval
#>      double_map 20.92847 22.238626 23.645280 22.726705 25.002493 34.29351   100
#>      row_unnest 15.30179 15.835506 16.714873 16.358134 17.314802 20.60496   100
#>    row_groupmap 10.10168 10.490374 11.237398 10.709524 11.452677 20.40186   100
#>          across 16.47369 17.117178 18.593908 17.945136 19.431190 29.29384   100
#>  map_custom_fun  6.85758  7.311608  7.953066  7.611394  8.305757 19.57006   100

score 3 · Answer 4 · answered Jan 10 '21 at 13:09

Here is another version :

library(broom)

purrr::map_df(grep("cov_", names(df), value = TRUE), 
   ~tidy(lm(reformulate(c('cont_a', 'cont_b', .x), 'dep'), data = df)))

#  term        estimate std.error statistic      p.value
#  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#1 (Intercept)   0.472     0.0812     5.81  0.0000000799
#2 cont_a       -0.103     0.0983    -1.05  0.296       
#3 cont_b        0.172     0.0990     1.74  0.0848      
#4 cov_a        -0.0455    0.0581    -0.783 0.436       
#5 (Intercept)   0.415     0.0787     5.27  0.000000846 
#6 cont_a       -0.0874    0.0984    -0.888 0.377       
#7 cont_b        0.181     0.0980     1.84  0.0682      
#8 cov_b         0.0482    0.0576     0.837 0.405

If you want a list output back use map instead of map_df.

score 3 · Answer 5 · answered Jan 11 '21 at 14:05

Just to add one more. If all of your covariates are the same and you want to just change out the cov_ variables you can simplify it possibly by selecting the variables first and then using the ~. notation in lm(). As you probably can guess it means "include all other variables in the dataset that are not the dv".

purrr::map(grep("cov_", colnames(df), value = TRUE), 
           function(x){
             df2 <- select(df, x, cont_a, cont_b, dep)
             tidy(lm(dep ~ ., df2))
           })

score 2 · Answer 6 · answered Jan 12 '21 at 14:47

Does anyone have tried something with data.table ?

library("data.table")
covariables <- c("cov_a", "cov_b")
# or, according to what you want to do :  
covariables <- names(df)[grepl(names(df), pattern = "cov_")]
formulas <- paste0("dep ~ cont_a + cont_b + ",covariables )
res <- df[, lapply(formulas, function(x) list(lm(x, data=.SD)))]
summary.lm(res$V1[[1]])
summary.lm(res$V2[[1]])

Here is an attempt (with others things picked up here and there among the previous answers to get some comparisons) (note that library broom and tidyverse are used just for the comparisons) :


library(tidyverse)
library(broom)
library("data.table")

library(microbenchmark)

test <- microbenchmark(dt = {
covariables <- c("cov_a", "cov_b") 
covariables <- names(df)[grepl(names(df), pattern = "cov_")]
formulas <- paste0("dep ~ cont_a + cont_b + ",covariables )
res <- df[, lapply(formulas, function(x) list(lm(x, data=.SD)))]
summary.lm(res$V1[[1]])
summary.lm(res$V2[[1]])
},
broom = {


df_tbl <- as_tibble(df)

df_tbl %>% 
  summarise(across(starts_with("cov_"), 
                   ~ list(tidy(lm(
                     reformulate(c("cont_a","cont_b", cur_column()), "dep"),
                     data = df_tbl))))) %>% 
  pivot_longer(cols = everything(),
               names_to = "var",
               values_to = "res") %>% 
  unnest(res)
}, gep_cov = {
  gen_model_expr <- function(var) {
    form = paste("dep ~ cont_a + cont_b +", var)
    tidy(lm(as.formula(form), data = df))
  }
  
  grep("cov_", names(df), value = TRUE) %>%
    map(gen_model_expr)
}, times = 100)

and the results :

> test
Unit: milliseconds
    expr     min       lq      mean  median       uq     max neval cld
      dt  2.3403  2.63330  3.116303  2.8403  3.21620 10.5219   100 a  
   broom 18.5069 20.40585 22.711605 21.3388 24.04675 46.8398   100   c
 gep_cov  7.7221  8.39180 10.086074  9.4315 10.54345 21.0377   100  b

andrew_reece · Answer 7 · 2021-01-15T05:14:01.680

You can do this with summarise and across, no map needed. This returns model output in list-columns.

tibble(df) %>% 
  summarise(across(starts_with("cov_"), ~list(tidy(lm(dep ~ cont_a + cont_b + .x))))) 

# A tibble: 1 x 2
  cov_a            cov_b           
  <list>           <list>          
1 <tibble [4 × 5]> <tibble [4 × 5]>

Then just unnest to view model output:

  ... %>% unnest(cov_a)

I don't love how the term for the covariate gets listed literally as '.x' in the output - you can fix this by referencing cur_column() in the lm formula, although it's not quite as clean:

tibble(df) %>% 
  summarise(
    across(starts_with("cov_"), 
           ~list(tidy(lm(reformulate(c("cont_a", "cont_b", cur_column()), "dep"))))))

Fitting several regression models by changing only one independent variable within mutate()

7 Answers7

Linked