I suspect that this question might be a duplicate, however, I found nothing satisfactory. Imagine a simple dataset with a structure like this:
set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
cov_b = rbinom(100, 1, prob = 0.5),
cont_a = runif(100),
cont_b = runif(100),
dep = runif(100))
cov_a cov_b cont_a cont_b dep
1 0 1 0.238726027 0.784575267 0.9860542973
2 1 0 0.962358936 0.009429905 0.1370674714
3 0 0 0.601365726 0.779065883 0.9053095817
4 1 1 0.515029727 0.729390652 0.5763018376
5 1 0 0.402573342 0.630131853 0.3954488591
6 0 1 0.880246541 0.480910830 0.4498024841
7 1 1 0.364091865 0.156636851 0.7065019011
8 1 1 0.288239281 0.008215520 0.0825027458
9 1 0 0.170645235 0.452458394 0.3393125802
10 0 0 0.172171746 0.492293329 0.6807875512
What I'm looking for is an elegant dplyr
/tidyverse
option to fit a separate regression model for every cov_
variable, while including the same set of additional variables and the same dependent variable.
I'm able to solve this problem using this code (require purrr
, dplyr
, tidyr
and broom
):
map(.x = names(df)[grepl("cov_", names(df))],
~ df %>%
nest() %>%
mutate(res = map(data, function(y) tidy(lm(dep ~ cont_a + cont_b + !!sym(.x), data = y)))) %>%
unnest(res))
[[1]]
# A tibble: 4 x 6
data term estimate std.error statistic p.value
<list> <chr> <dbl> <dbl> <dbl> <dbl>
1 <tibble [100 × 5]> (Intercept) 0.472 0.0812 5.81 0.0000000799
2 <tibble [100 × 5]> cont_a -0.103 0.0983 -1.05 0.296
3 <tibble [100 × 5]> cont_b 0.172 0.0990 1.74 0.0848
4 <tibble [100 × 5]> cov_a -0.0455 0.0581 -0.783 0.436
[[2]]
# A tibble: 4 x 6
data term estimate std.error statistic p.value
<list> <chr> <dbl> <dbl> <dbl> <dbl>
1 <tibble [100 × 5]> (Intercept) 0.415 0.0787 5.27 0.000000846
2 <tibble [100 × 5]> cont_a -0.0874 0.0984 -0.888 0.377
3 <tibble [100 × 5]> cont_b 0.181 0.0980 1.84 0.0682
4 <tibble [100 × 5]> cov_b 0.0482 0.0576 0.837 0.405
However, I would like to avoid the use of double-map()
and solve it by using a somehow more direct or elegant approach.