Extract models based on conditions in r

Question

I ran about 30,000 multiple regression models using pylr packages.

fit = plyr::dlply(data, "id", function(x) lm(t ~ d1 + d2 + d3, data = x))

We have more than 30,000 IDs so it came out with more than 30,000 models. I cannot read the models one by one. Are there any efficient ways to look at the models based on the conditions. I want to look at models according to

top 5 p-value (from the smallest)
top r squred
all the models with d3 significant

At this moment, I can only look at the model one by one using l_ply(fit, summary, .print = TRUE)

I tried to use tidy function in broom packages in order to change the results into cleaned format so that i can export .csv file but it wouldn't work out.

test = broom::tidy(fit)

I received error messages telling me Error in tidy.list(fit) : No tidying method recognized for this list

Any advice would be appreciated!

Here are some data added:

df <- data.frame(id=sample(LETTERS[1:10], replace= T, 10),
             d1=rnorm(20), d2=rnorm(20), d3=rnorm(20), t= rnorm(20), stringsAsFactors = F)

fit1 = plyr::dlply(df, "id", function(x) lm(t ~ d1 + d2 + d3, data = x))

Take a look at the "Many Models" [article](http://r4ds.had.co.nz/many-models.html) that makes use of `broom` functions — camille, Jun 18 '18 at 19:51
@camille thank you very much! I followed the article and got exactly what I want! as a follow-up question, are there any way we can do the same thing on the r-squared and p.value for each independent variable? — NightDog, Jun 20 '18 at 14:18

dww · Answer 1 · 2018-06-18T19:58:11.263

2

Here's an example of extracting the 3 highest R^2 models.

r2 = unlist(lapply(fit, function(x) summary(x)$r.squared))
fit[order(r2, decreasing = T)[1:3]]

The others are analagous: See this post https://stackoverflow.com/a/5587781/2761575 for how to extract p-values if you want to rank by these instead

edited Jun 18 '18 at 19:58

answered Jun 18 '18 at 19:50

dww

30,425
5
68
111

score 2 · Accepted Answer · answered Jun 20 '18 at 15:12

Here's a workflow based on the Many Models article, using broom functions on nested data frames. Note that I increased the size of the data frame, just to have more observations to work with.

With this method, you group data by ID, nest it, and create the lm objects by mapping over the nested data. broom::glance gives you summary output on the models:

library(tidyverse)
library(broom)

set.seed(124)
df <- data.frame(id=sample(LETTERS[1:10], size = 10, replace = T),
                 d1=rnorm(100), d2=rnorm(100), d3=rnorm(100), t= rnorm(100), stringsAsFactors = F)

models_df <- df %>%
  group_by(id) %>%
  nest() %>%
  mutate(mod = map(data, function(x) lm(t ~ d1 + d2 + d3, data = x))) %>%
  mutate(glance = map(mod, glance)) %>%
  unnest(glance)

models_df
#> # A tibble: 6 x 14
#>   id    data   mod   r.squared adj.r.squared sigma statistic p.value    df
#>   <chr> <list> <lis>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>
#> 1 A     <tibb… <S3:…    0.495         0.243  0.955     1.96   0.221      4
#> 2 E     <tibb… <S3:…    0.142        -0.0189 0.907     0.883  0.471      4
#> 3 F     <tibb… <S3:…    0.179         0.0247 1.05      1.16   0.355      4
#> 4 D     <tibb… <S3:…    0.818         0.727  0.522     8.97   0.0123     4
#> 5 C     <tibb… <S3:…    0.0514       -0.0580 0.994     0.470  0.706      4
#> 6 J     <tibb… <S3:…    0.202        -0.197  0.933     0.506  0.693      4
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>

Now that you have a data frame with summary statistics like R-squared, you can use regular dplyr commands like top_n to get rankings:

# smallest 5 p-value
models_df %>% top_n(-5, p.value)
#> # A tibble: 5 x 14
#>   id    data   mod   r.squared adj.r.squared sigma statistic p.value    df
#>   <chr> <list> <lis>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>
#> 1 A     <tibb… <S3:…     0.495        0.243  0.955     1.96   0.221      4
#> 2 E     <tibb… <S3:…     0.142       -0.0189 0.907     0.883  0.471      4
#> 3 F     <tibb… <S3:…     0.179        0.0247 1.05      1.16   0.355      4
#> 4 D     <tibb… <S3:…     0.818        0.727  0.522     8.97   0.0123     4
#> 5 J     <tibb… <S3:…     0.202       -0.197  0.933     0.506  0.693      4
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>

# top r-sq
models_df %>% top_n(1, r.squared)
#> # A tibble: 1 x 14
#>   id    data   mod   r.squared adj.r.squared sigma statistic p.value    df
#>   <chr> <list> <lis>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>
#> 1 D     <tibb… <S3:…     0.818         0.727 0.522      8.97  0.0123     4
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>

For significant d3 terms, I went back to the models data frame and mapped broom::tidy across the models, which gives summary stats on each term, then filtered to study the d3 terms.

# significant d3--assuming alpha = 0.05
models_df %>%
  mutate(tidied = map(mod, tidy)) %>%
  unnest(tidied, .drop = T) %>%
  filter(term == "d3") %>%
  filter(p.value < 0.05)
#> # A tibble: 1 x 17
#>   id    r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#>   <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl>
#> 1 D         0.818         0.727 0.522      8.97  0.0123     4  -5.13  20.3
#> # ... with 8 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
#> #   term <chr>, estimate <dbl>, std.error <dbl>, statistic1 <dbl>,
#> #   p.value1 <dbl>

Created on 2018-06-20 by the reprex package (v0.2.0).

@Y.Yuan I was happy to practice this approach, since I'm still getting the hang of the nested data frames and the move to `map` to replace `do` — camille, Jun 20 '18 at 15:28
@camille when i tried to nest data using models_df <- df %>% group_by(id) %>% nest() %>% mutate(mod = map(data, function(x) lm(t ~ d1 + d2 + d3, data = x))) %>% mutate(glance = map(mod, glance)) %>% unnest(glance) I've got an error message - cannot convert a list to a function... — NightDog, Jun 20 '18 at 16:03
Not sure, try naming it something other than glance...that wasn't the safest idea on my part to name a variable the same as a function — camille, Jun 20 '18 at 16:07

Extract models based on conditions in r

2 Answers2