7

I'm new to the purrr paradigm and am struggling with it.

Following a few sources I have managed to get so far as to nest a data frame, run a linear model on the nested data, extract some coefficients from each lm, and generate a summary for each lm. The last thing I want to do is extract the "r.squared" from the summary (which I would have thought would be the simplest part of what I'm trying to achieve), but for whatever reason I can't get the syntax right.

Here's a MWE of what I have that works:

library(purrr)
library(dplyr)
library(tidyr)

mtcars %>%
  nest(-cyl) %>%
  mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
         sum = map(fit, ~summary))

and here's my attempt to extract the r.squared which fails:

mtcars %>%
  nest(-cyl) %>%
  mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
         sum = map(fit, ~summary),
         rsq = map_dbl(sum, "r.squared"))
Error in eval(substitute(expr), envir, enclos) : 
  `x` must be a vector (not a closure)

This is superficially similar to the example given on the RStudio site:

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map_dbl("r.squared")

This works however I would like the r.squared values to sit in a new column (hence the mutate statement) and I'd like to understand why my code isn't working instead of working-around the problem.

EDIT:

Here's a working solution that I came to using the solutions below:

mtcars %>%
      nest(-cyl) %>% 
      mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
             summary = map(fit, glance),
             r_sq = map_dbl(summary, "r.squared"))

EDIT 2:

So, it actually turns out that the bug is from the inclusion of the tilde key in the summary = map(fit, ~summary) line. My guess is that the makes the object a function which is nest and not the object returned by the summary itself. Would love an authoritative answer on this if someone wants to chime in.

To be clear, this version of the original code works fine:

mtcars %>%
  nest(-cyl) %>%
  mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
         summary = map(fit, summary),
         r_sq = map_dbl(summary, "r.squared"))
niklz
  • 95
  • 7

3 Answers3

6

To fit in your current pipe, you'd want to use unnest along with map and glance from the broom package.

library(tidyr)
library(dplyr)
library(broom)

mtcars %>%
  nest(-cyl) %>%
  mutate(fit = map(data, ~lm(mpg ~ wt, data = .))) %>% 
  unnest(map(fit, glance))

You'll get more than just the r-squared, and from there you can use select to drop what you don't need.

If you want to keep the model summaries nested in list-columns:

mtcars %>%
  nest(-cyl) %>% 
  mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
         summary = map(fit, glance)) 

If you want to just extract a single value from a nested frame you just need to use map to the actual value (and not [[ or extract2 as I originally suggested, many thanks for finding that out).

mtcars %>%
  nest(-cyl) %>% 
  mutate(fit = map(data, ~lm(mpg ~ wt, data = .)),
         summary = map(fit, glance),
         r_sq = map_dbl(summary, "r.squared"))
Jake Kaupp
  • 7,892
  • 2
  • 26
  • 36
  • Well this does seem to be what I want to do, I'm just confused as to why the code is constructed this way. I don't understand why you unnnest the data? Could you explain if you can? Thanks for the answer! – niklz Dec 02 '16 at 13:49
  • 1
    Using `unnest` takes the data frame out of the list column and spreads all available columns to the parent data frame. You can leave it nested but the r-squared column won't be directly accessible. I'll update the answer to have code without `unnest`. – Jake Kaupp Dec 02 '16 at 13:55
  • So the unnest is for the result of the map(fit, ~glance) statement, I thought it was unnesting the nested tibble (which is where I was getting confused). This method also circumvents the requirement to make the sum column with summaries, right? If I understand; the coeffs column in your second version would contain the same information (albeit in a different format). Is there no way I could have extracted the "r.squared" from the sum column though? Just I see myself hitting this wall again where I have a nested list and I want to grab out just one element from it. – niklz Dec 02 '16 at 14:18
  • 1
    You are correct. I've added the method I use to extract single columns out of a nested data frame in a list-column. I also cleaned up the code, having 2 summary maps was pointless, and could be done in one step with mapping `glance` to `fit`. – Jake Kaupp Dec 02 '16 at 14:44
  • Awesome! Actually I've modified it to get what I wanted originally. I've added it to the original question, this way makes the most sense to me. Thanks for all the help. – niklz Dec 02 '16 at 14:55
  • 1
    Wierdly there was nothing wrong with the way I had written my map_dbl line, it was just not working with the object returned by summary. Works perfectly with glance. Seems a bit weird.. – niklz Dec 02 '16 at 15:01
  • You're welcome, thanks for helping me streamline my own workflow now that `extract2` isn't needed. – Jake Kaupp Dec 02 '16 at 15:09
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/129638/discussion-between-jake-kaupp-and-niklz). – Jake Kaupp Dec 02 '16 at 15:16
  • Update for tidyr versions 1.0.0 and above regarding the use of unnest in the first solution above from @jake-kaupp. According to the nest/unnest documentation, the ability to create a new variable in unnest has been deprecated. Instead the developer directs you to mutate first then unnest. So `mutate(fit = map(data, ~lm(mpg~wt, data = .))) %>% unnest(map(fit, glance))` becomes `mutate(fit = map(data, ~lm(mpg~wt, data = .)), fit = map(fit, glance)) %>% unnest(fit)` – ESELIA Jan 21 '22 at 15:12
5

I think for what you'd like to achieve, you are better off using the glance() function from the broom package:

library(broom)
library(dplyr)
mtcars %>%
  group_by(cyl) %>%
  do(glance(lm(mpg ~ wt, data = .))) %>%
  select(cyl, r.squared)
#    cyl r.squared
#  <dbl>     <dbl>
#1     4 0.5086326
#2     6 0.4645102
#3     8 0.4229655
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • This does get the desired output, but (sorry for being picky) I'd really like to find an implementation that works in the current pipe that i have. I'm sure there's a way and it's just a case of getting the right syntax. Thanks for the answer – niklz Dec 02 '16 at 11:12
  • If all you want is the results of the lm model, this is a simpler answer. However @jake-kaupp 's solution retains the original variables and the model in the solution, which could be useful for certain situations, as in returning an output from a user defined function. – ESELIA Jan 21 '22 at 15:17
1

There must be a better way, here is my try with pipes:

mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map_dbl("r.squared") %>% 
  list() %>% 
  as.data.frame(col.names = "r.squared") %>% 
  add_rownames(var = "cyl")

# # A tibble: 3 × 2
#     cyl r.squared
#   <chr>     <dbl>
# 1     4 0.5086326
# 2     6 0.4645102
# 3     8 0.4229655

Note: You might get below a warning.

Warning message: Deprecated, use tibble::rownames_to_column() instead.

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • Thanks, there is indeed a better way; check my edit on the OP – niklz Dec 02 '16 at 15:03
  • @zx8754 I have a hart time to understand why `map_dbl("r.squared")` is working in this example. I mean `"r.squared"` is not a function, so how exactly is this extraction made or applied? Could you clarify? :) – stats-hb May 06 '17 at 14:16