2

How can I create a column of formulas (such as y ~ x or y ~ log(x) or ...) from a nested dataframe of models?

In attempt below, the model column contains the model with the largest value of R squared. The purpose of creating a column of model formulas is to identify which model was used in each row.

library(tidyverse)
library(broom)

df <- gapminder::gapminder %>% 
  select(country, x = year, y = lifeExp) %>%
  group_by(country) %>%
  nest()

rsq_f <- function(model){summary(model)$r.squared}

best_model <- function(df){
  models <- list(
    lm(formula = y ~ x, data = df),
    lm(formula = y ~ log(x), data = df),
    lm(formula = log(y) ~ x, data = df),
    lm(formula = log(y) ~ log(x), data = df)
  )

  R_squared <- map_dbl(models, rsq_f)
  best_model_num <- which.max(R_squared)

  models[best_model_num][[1]]    
}

models <- df %>%
  mutate(
    model = map(data, best_model),
    rsq = map(model, broom::glance) %>% map_dbl("r.squared"),
    fun_call = map(model, formula)
  )

The output is

> models
# A tibble: 142 x 5
   country     data              model      rsq fun_call     
   <fct>       <list>            <list>   <dbl> <list>       
 1 Afghanistan <tibble [12 x 2]> <S3: lm> 0.949 <S3: formula>
 2 Albania     <tibble [12 x 2]> <S3: lm> 0.912 <S3: formula>
 3 Algeria     <tibble [12 x 2]> <S3: lm> 0.986 <S3: formula>
 4 Angola      <tibble [12 x 2]> <S3: lm> 0.890 <S3: formula>
 5 Argentina   <tibble [12 x 2]> <S3: lm> 0.996 <S3: formula>
 6 Australia   <tibble [12 x 2]> <S3: lm> 0.983 <S3: formula>
 7 Austria     <tibble [12 x 2]> <S3: lm> 0.994 <S3: formula>
 8 Bahrain     <tibble [12 x 2]> <S3: lm> 0.968 <S3: formula>
 9 Bangladesh  <tibble [12 x 2]> <S3: lm> 0.997 <S3: formula>
10 Belgium     <tibble [12 x 2]> <S3: lm> 0.995 <S3: formula>
# ... with 132 more rows

Instead of <S3: formula> I'd like to actually see the formula used by the model.

Vlad
  • 3,058
  • 4
  • 25
  • 53
  • Can you `...%>% unnest(fun_call)` ? – AntoniosK Aug 20 '18 at 14:59
  • `models %>% unnest(fun_call)` Error: Each column must either be a list of vectors or a list of data frames [fun_call] - guess that means no... – Vlad Aug 20 '18 at 15:00
  • 2
    each `lm` model should have a `terms` element, which is a `list` with the formula used, if you use `as.character(lm$terms)` on that you might have something to work with.. – RLave Aug 20 '18 at 15:13
  • since this gives you not exactly the correct format, you need to rearrange it, something like: `paste(as.character(my_lm$terms)[2],as.character(my_lm$terms)[1], as.character(my_lm$terms)[-c(1:2)])` if I understand correctly your question, this will give you a string with the correct formula.. – RLave Aug 20 '18 at 15:14
  • Formulas aren't atomic types that a tibble can display in a table format. Are you just trying to look at it in the console? Then you should write your own print function. Or as someone else mentioned, convert the formula to a character value for display. – MrFlick Aug 20 '18 at 15:14

2 Answers2

4

Based on RLave's comment, the answer is simply adding as.character():

models <- df %>%
  mutate(
    model = map(data, best_model),
    rsq = map(model, broom::glance) %>% map_dbl("r.squared"),
    fun_call = map(model, formula) %>% as.character()
  )

which gives:

# A tibble: 142 x 5
   country     data              model      rsq fun_call  
   <fct>       <list>            <list>   <dbl> <chr>     
 1 Afghanistan <tibble [12 x 2]> <S3: lm> 0.949 y ~ log(x)
 2 Albania     <tibble [12 x 2]> <S3: lm> 0.912 y ~ log(x)
 3 Algeria     <tibble [12 x 2]> <S3: lm> 0.986 y ~ log(x)
 4 Angola      <tibble [12 x 2]> <S3: lm> 0.890 y ~ log(x)
 5 Argentina   <tibble [12 x 2]> <S3: lm> 0.996 y ~ x     
 6 Australia   <tibble [12 x 2]> <S3: lm> 0.983 log(y) ~ x
 7 Austria     <tibble [12 x 2]> <S3: lm> 0.994 log(y) ~ x
 8 Bahrain     <tibble [12 x 2]> <S3: lm> 0.968 y ~ log(x)
 9 Bangladesh  <tibble [12 x 2]> <S3: lm> 0.997 log(y) ~ x
10 Belgium     <tibble [12 x 2]> <S3: lm> 0.995 log(y) ~ x
# ... with 132 more rows
Vlad
  • 3,058
  • 4
  • 25
  • 53
0

To make myself more clear I'll post as an answer with an example, if I understood correctly you seek to have a column with the formula, like a string "y ~ x".

Suppose we have a simple lm:

x <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
y <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
my_lm <- lm(y~ x)   

By looking at terms you have the formula, just not correctly arranged:

as.character(my_lm[["terms"]])
# [1] "~" "y" "x"

You just need to re-arrange the first two terms:

paste(as.character(my_lm$terms)[2],as.character(my_lm$terms)[1], as.character(my_lm$terms)[-c(1:2)])
# [1] "y ~ x"

And this could be assigned with mutate to a column.

RLave
  • 8,144
  • 3
  • 21
  • 37