Efficiently and iteratively fit, diagnose, modify, and organize linear models in R (into one place)

Question

oftentimes I need to fit several linear models (dozens) in R. These models may have the same predictor but varying response variables, data distribution families, and more.

What I've been doing: fitting models one by one, looking at diagnostic plots one by one, and altering the model as needed (see below).

My goal: fit models with functions that fit initial models using a specified set of predictors and response variables and families and datasets. Put those models into one place, like a dataframe, where I can access the model object itself as well as characteristics like the model's AIC or formula. Provide diagnostic plots with DHARMa. If necessary, modify the model and then update the dataframe to reflect that model's changes.

How can I achieve this ideal workflow? I've tried to make a reprex, below. Thanks for any advice!


library(tidyverse)
library(DHARMa)
library(magrittr)
library(glmmTMB)

cars <- datasets::mtcars

# 3 models, each with different response variables. Some (e.g. glm2)
# have non-Gaussian families; some have filtered datasets
lm1 <- glmmTMB(mpg ~ hp, data = cars)
glm2 <- glmmTMB(cyl ~ hp, 
                data = cars,
               family = "poisson")
lm3 <- glmmTMB(drat ~ hp,
               data = cars %>%
                 dplyr::filter(carb != 1))

# next, look at diagnostic plots
plot(DHARMa::simulateResiduals(lm1))
plot(DHARMa::simulateResiduals(glm2))
plot(DHARMa::simulateResiduals(lm3))

# transform one model's response variable to satisfy assumptions
glm2 <- glmmTMB(cyl^2 ~ hp, 
                data = cars,
                  family = "poisson")

# check diagnostics again
plot(DHARMa::simulateResiduals(lm2))

# organize models into one place. A dataframe? A list? Somewhere where
# I can access the model itself and the model formula so I can extract
# those things to be used as arguments in other functions
?

I_O · Accepted Answer · 2023-06-07T21:43:49.383

You can put "everything" into a dataframe, as long as the receiving column is a list-column (meaning cell contents of this columns are lists). That's a convenient way to slice & dice your data and keep them and/or model formulae, recipes, plots, you name it ... in a tabular structure. Just remember to encapsulate items in a list, where necessary.

In case of your example, you could:

create a dataframe of all desired combinations of model function, dependent, predictors etc.:

the_grid <- 
  expand.grid(dependent = c('mpg', 'cyl', 'drat'),
              type = 'glmmTMB',
              predictors = c('hp'),
              predicate = c('TRUE', 'TRUE', 'carb != 1'),
              stringsAsFactors = FALSE
              )

use this dataframe to add list-columns containing, e.g. a model and a plot per parameter combination:

library(DHARMa)
library(glmmTMB)
library(dplyr)
library(ggeffects) ## to plot model effects

the_models <- 
  the_grid |>
  rowwise() |> ## !important
  mutate(
    m = do.call(type,
                args = list(formula = reformulate(response = dependent,
                                                  termlabels = predictors,
                                                  ),
                            data = mtcars
                            )
                ) |> list(), ## don't forget to wrap the result in a list
    ## add a ggeffect plot object for fun:
    p = list(ggpredict(m) |> plot()) ## again, use a list
  )

## > the_models
## # A tibble: 9 x 6
## # Rowwise: 
##   dependent type    predictors predicate m         p               
##   <fct>     <fct>   <fct>      <fct>     <list>    <list>          
## 1 mpg       glmmTMB hp         TRUE      <glmmTMB> <named list [1]>
## 2 cyl       glmmTMB hp         TRUE      <glmmTMB> <named list [1]>
## 3 drat      glmmTMB hp         TRUE      <glmmTMB> <named list [1]>
## 4 mpg       glmmTMB hp         TRUE      <glmmTMB> <named list [1]>
## 5 cyl       glmmTMB hp         TRUE      <glmmTMB> <named list [1]>
## 6 drat      glmmTMB hp         TRUE      <glmmTMB> <named list [1]>
## 7 mpg       glmmTMB hp         carb != 1 <glmmTMB> <named list [1]>
## 8 cyl       glmmTMB hp         carb != 1 <glmmTMB> <named list [1]>

example: model summary for row 1:

## > the_models$m[[1]] |> summary()
##  Family: gaussian  ( identity )
## Formula:          mpg ~ hp
## Data: filter(mtcars, TRUE)
## 
##      AIC      BIC   logLik deviance df.resid 
##    181.2    185.6    -87.6    175.2       29 
## 
## 
## Dispersion estimate for gaussian family (sigma^2):   14 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 30.098862   1.582037  19.025  < 2e-16 ***
## hp          -0.068228   0.009798  -6.964 3.32e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

example: effect plot for row 5:

the_models$p[[5]]

This is otherwise great, but please `reformulate()` rather than `parse()`/`eval()`ing ... — Ben Bolker, Jun 07 '23 at 20:46
Ben Bolker, thank you for pointing that out. I (hopefully) incorporated `reformulate` as should be, but am at a loss about how to set the filter predicate other than parse()/eval(). Please feel free to edit or repost in a clean version, I'm afraid I'm missing something essential. — I_O, Jun 07 '23 at 21:25
The answer to that is probably to use some of the new `rlang` stuff (!!!/ensym/etc.), but I'm not very good at that stuff so I'll let someone else chime in ... e.g. https://stackoverflow.com/questions/69056666/alternatives-to-eval-parse-with-dplyr — Ben Bolker, Jun 07 '23 at 22:56
Thanks so much for this! Two questions: What's the purpose of using expand.grid() as opposed to data.frame()? And could you please add some comments explaining what's going on in the mutate chunk? I'm not sure what's happening in these lines. — stb, Jun 08 '23 at 15:06
You're welcome! `expand.grid` indeed generates a dataframe (check with `class(expand.grid()). `do.call(f, args = list(x, y))` calls (=executes) `f(x, y)`. I use it to be flexible which function to call (here: the string stored in column *type* which is "glmmTMB" but could be "lm" or "gam" or any other modelling function). In the `args` argument I specify which formula to use on which data. Initially I also added a filter ("predicate") which worked but as Ben Bolker pointed out was ugly coding (haven't wrapped my head around the proper code yet). — I_O, Jun 08 '23 at 16:50
Anyhow, if you expect a lot of modelling and tuning it might pay to invest in {caret} and/or {tidymodels} right from the start (as opposed to above solution for the occasional modeller like me): https://towardsdatascience.com/caret-vs-tidymodels-create-complete-reusable-machine-learning-workflows-5c50a7befd2d , https://www.tidymodels.org/ , https://topepo.github.io/caret/index.html — I_O, Jun 08 '23 at 16:57

Efficiently and iteratively fit, diagnose, modify, and organize linear models in R (into one place)

1 Answers1