how to compute regressions by group with lm, do, broom and dplyr?

Question

Consider this simple example

> dataframe <- data_frame(id = c(1,2,3,4,5,6),
+                         group = c(1,1,1,2,2,2),
+                         value = c(200,400,120,300,100,100))
> dataframe
# A tibble: 6 x 3
     id group value
  <dbl> <dbl> <dbl>
1     1     1   200
2     2     1   400
3     3     1   120
4     4     2   300
5     5     2   100
6     6     2   100

Here I want to use regress value on a constant, by groups of group. I have the get_mean() function

get_mean <- function(data, myvar){
  col_name <- as.character(substitute(myvar))
  fmla <- as.formula(paste(col_name, "~ 1"))
  tidy(lm(data = data,fmla)) %>% pull(estimate)
}

The naive approach:

dataframe %>% group_by(group) %>% mutate(bug = get_mean(., value),
                                         Ineedthis = max(value))

# A tibble: 6 x 5
# Groups:   group [2]
     id group value      bug Ineedthis
  <dbl> <dbl> <dbl>    <dbl>     <dbl>
1     1     1   200 203.3333       400
2     2     1   400 203.3333       400
3     3     1   120 203.3333       400
4     4     2   300 203.3333       300
5     5     2   100 203.3333       300
6     6     2   100 203.3333       300

FAILS because you can see the mean is not computed by groups.

It is well known that using do will work.

dataframe %>% group_by(group) %>% do(bug = get_mean(., value))
Source: local data frame [2 x 2]
Groups: <by row>

# A tibble: 2 x 2
  group       bug
* <dbl>    <list>
1     1 <dbl [1]>
2     2 <dbl [1]>

However, I dont know how to use do to get the other Ineedthis variable and I dont know how to unlist the bug variable. I want my output to be:

# A tibble: 6 x 5
     id group value good         Ineedthis
  <dbl> <dbl> <dbl>    <dbl>     <dbl>
1     1     1   200 240            400
2     2     1   400 240            400
3     3     1   120 240            400
4     4     2   300 166.6666       300
5     5     2   100 166.6666       300
6     6     2   100 166.6666       300

Any ideas? Thanks!!

thanks @akrun, but how can I also get the `Ineedthis` variable? Do you have a working solution? thanks!! — ℕʘʘḆḽḘ, Aug 24 '17 at 15:00

score 2 · Answer 1 · answered Aug 24 '17 at 16:46

I made some changes to your get_mean function but it does functionally the same thing. See:

get_mean <- function(., myvar){
  dat <- substitute(myvar) %>% data.frame(.) %>% setNames('vec')
  out <- lm(data = dat,'vec ~ 1')$coefficients[1] %>% unname(.)
  return(out)
}

Allowing us to do:

dataframe %>%
  group_by(group) %>%
  summarise(good = get_mean(., value), Ineedthis= max(value)) %>%
  left_join(dataframe, ., by = 'group')

Resulting in:

  id group value     good Ineedthis
1  1     1   200 240.0000       400
2  2     1   400 240.0000       400
3  3     1   120 240.0000       400
4  4     2   300 166.6667       300
5  5     2   100 166.6667       300
6  6     2   100 166.6667       300

thanks @Zach but I need to keep the function as is because it is used elsewhere. Also, I think it is a good opportunity to use `do` here instead of summarize dont you think — ℕʘʘḆḽḘ, Aug 24 '17 at 16:48

ℕʘʘḆḽḘ · Accepted Answer · 2017-08-24T17:04:37.873

Here is a cool solution that reproduce the expected output. Not sure its the better solution but still worth sharing with my fellow coding lovers :)

get_output <- function(dataframe){
temp <- dataframe %>% 
  group_by(group) %>% 
  do({mymean =  get_mean(., value);
      myother = max(.$value);
      dplyr::data_frame(mean = mymean,
                        other = myother)})
dataframe %>% left_join(temp)
  }


     > get_output(dataframe)
Joining, by = "group"
# A tibble: 6 x 5
     id group value     mean other
  <dbl> <dbl> <dbl>    <dbl> <dbl>
1     1     1   200 240.0000   400
2     2     1   400 240.0000   400
3     3     1   120 240.0000   400
4     4     2   300 166.6667   300
5     5     2   100 166.6667   300
6     6     2   100 166.6667   300

how to compute regressions by group with lm, do, broom and dplyr?

2 Answers2