applying a glm() to several columns individually in dplyr 1.0.0

Question

I would like to be able to apply the a similar glm to several columns and am wondering if there is a neat way to do this with the new dplyr functionality

# data
set.seed(1234)
df <- data.frame(out1 = rbinom(100, 1, prob = 0.5),
                 out2 = c(rbinom(50, 1, prob = 0.2),
                          rbinom(50, 1, prob = 0.8)),
                 pred = factor(rep(letters[1:2], each = 50)))

Following the method laid out in this post I can use purrr::map

df %>% 
  select_if(is.numeric) %>%
    map(~glm(. ~ df$pred,
             family = binomial))

# output
# $out1
# 
# Call:  glm(formula = . ~ df$pred, family = binomial)
# 
# Coefficients:
#   (Intercept)     df$predb  
# 3.589e-16   -4.055e-01  
# 
# Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
# Null Deviance:        137.6 
# Residual Deviance: 136.6  AIC: 140.6
# 
# $out2
# 
# Call:  glm(formula = . ~ df$pred, family = binomial)
# 
# Coefficients:
#   (Intercept)     df$predb  
# -1.153        2.305  
# 
# Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
# Null Deviance:        138.6 
# Residual Deviance: 110.2  AIC: 114.2

This returns a list and works just fine. But I was wondering if it was possible use the new dplyr 1.0.0 functionality to get a similar (or even neater) result? the sort of neat, row-by-row, data frame output returned by broom::glance or broom::tidy. Something along the lines of this blog post, but transposed to this version of the problem, and using across() (potentially, at a guess)?

Also it would be nice if I could use starts_with("out") to select the columns that the glm() function is applied to.

score 2 · Accepted Answer · answered Nov 06 '21 at 07:34

Perhaps, it would be easier if you get the data in long format.

library(tidyverse)
library(broom)

df %>%
  pivot_longer(cols = starts_with('out')) %>%
  group_by(name) %>%
  summarise(model = list(glm(value~pred, family = binomial))) %>%
  mutate(data = map(model, tidy)) %>%
  unnest(data)

#  name  model  term         estimate std.error statistic     p.value
#  <chr> <list> <chr>           <dbl>     <dbl>     <dbl>       <dbl>
#1 out1  <glm>  (Intercept)  3.59e-16     0.283  1.27e-15 1.00       
#2 out1  <glm>  predb       -4.05e- 1     0.404 -1.00e+ 0 0.316      
#3 out2  <glm>  (Intercept) -1.15e+ 0     0.331 -3.48e+ 0 0.000500   
#4 out2  <glm>  predb        2.31e+ 0     0.468  4.92e+ 0 0.000000853

Thanks again @Ronak Shah. What does the `data = foo` argument do in the `mutate` and `unnest` functions? — llewmills, Nov 06 '21 at 21:24
`data` is storing the output returned from `tidy` function. It returns a dataframe, hence we `unnest` it to get into separate columns. — Ronak Shah, Nov 07 '21 at 02:51

applying a glm() to several columns individually in dplyr 1.0.0

1 Answers1

Linked