1

Consider this simple example:

library(dplyr)
library(broom)

dataframe <- data_frame(id = c(1,2,3,4,5,6),
                        group = c(1,1,1,2,2,2),
                        value = c(200,400,120,300,100,100))

# A tibble: 6 x 3
     id group value
  <dbl> <dbl> <dbl>
1     1     1   200
2     2     1   400
3     3     1   120
4     4     2   300
5     5     2   100
6     6     2   100

Here I want to group by group and create two columns.

One is the number of distinct values in value (I can use dplyr::n_distinct), the other is the constant term from a regression of value on the vector 1. That is, the output of

tidy(lm(data = dataframe, value ~ 1)) %>% select(estimate)

 estimate
1 203.3333

The difficulty here is combining these two simple outputs into a single mutate statement that preserves the grouping.

I tried something like:

formula1 <- function(data, myvar){
tidy(lm(data = data, myvar ~ 1)) %>% select(estimate)
}

dataframe %>% group_by(group) %>% 
  mutate(distinct = n_distinct(value),
         mean = formula1(., value))

but this does not work. What I am missing here? Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 2
    You do realize "the constant term from a regression of value on the vector 1" will just be the mean of those values, right? I assume your actual application is more complicated? – MrFlick Aug 23 '17 at 17:43
  • yes, of course. I actually plan on getting more statistics from the `lm` output. but this is the simplest example I thought of – ℕʘʘḆḽḘ Aug 23 '17 at 17:45
  • 1
    Have you had a closer look at the output of an lm model? For example if you run, `mod <- lm(mpg ~ cyl + hp, mtcars)` check `str(mod)`. You'll find its a named list and you just have to extract the desired parts. – talat Aug 23 '17 at 17:51
  • @docendodiscimus basically its even simpler to use `broom`. but as you see in my code, this does not work as expected. – ℕʘʘḆḽḘ Aug 23 '17 at 17:54
  • 1
    I think you need `pull` in place of `select` if you want to get a single value from the `tidy` output. – aosmith Aug 23 '17 at 18:14
  • damn! it works! thanks @aosmith. do you mind posting a solution then? – ℕʘʘḆḽḘ Aug 23 '17 at 18:21

1 Answers1

4

This approach will work if you use pull in place of select. This extracts the single estimate value from the tidy output.

formula1 <- function(data, myvar){
     tidy(lm(data = data, myvar ~ 1)) %>% pull(estimate)
}

dataframe %>% 
     group_by(group) %>% 
     mutate(distinct = n_distinct(value),
            mean = formula1(., value))

# A tibble: 6 x 5
# Groups:   group [2]
     id group value distinct     mean
  <dbl> <dbl> <dbl>    <int>    <dbl>
1     1     1   200        3 240.0000
2     2     1   400        3 240.0000
3     3     1   120        3 240.0000
4     4     2   300        2 166.6667
5     5     2   100        2 166.6667
6     6     2   100        2 166.6667
aosmith
  • 34,856
  • 9
  • 84
  • 118