Computing difference between averages by group and logical values using dplyr

Question

Does anyone know of a way to use dplyr to compute the difference between averages for some_var == TRUE and some_var == FALSE, grouped by a third variable?

For example, given the following example dataframe:

library('dplyr')

dat <- iris %>% 
     mutate(wide=Sepal.Width > 3) %>% 
     group_by(Species, wide) %>% 
     summarize(mean_width=mean(Sepal.Width))

dat

# A tibble: 6 x 3
# Groups:   Species [?]
     Species  wide mean_width
      <fctr> <lgl>      <dbl>
1     setosa FALSE   2.900000
2     setosa  TRUE   3.528571
3 versicolor FALSE   2.688095
4 versicolor  TRUE   3.200000
5  virginica FALSE   2.800000
6  virginica  TRUE   3.311765

Does anyone know of a way to derive a new data frame with the differences for wide == TRUE and wide == FALSE, by Species?

This can be done using several statements:

false_vals <- dat %>% filter(wide==FALSE)
true_vals <- dat %>% filter(wide==TRUE)

diff <- data.frame(Species=unique(dat$Species), diff=true_vals$mean_width - false_vals$mean_width)

> diff
     Species      diff
1     setosa 0.6285714
2 versicolor 0.5119048
3  virginica 0.5117647

However, this seems like something that should be achievable directly with dplyr.

Any ideas?

How about `spread` it it two columns – akrun Dec 09 '17 at 15:43 — akrun, Dec 09 '17 at 15:43

score 5 · Accepted Answer · answered Dec 09 '17 at 15:45

5

Using spread() from tidyr package:

library(tidyr)

iris %>% mutate(wide=Sepal.Width > 3) %>% 
        group_by(Species, wide) %>% 
        summarize(mean_width=mean(Sepal.Width)) %>%
        spread(wide, mean_width) %>%
        summarise(diff = `TRUE` - `FALSE`)
#     Species      diff
#1     setosa 0.6285714
#2 versicolor 0.5119048
#3  virginica 0.5117647

answered Dec 09 '17 at 15:45

mtoto

23,919
4
58
71

Perfect! Thanks for the quick solution! – Keith Hughitt Dec 09 '17 at 15:51
P.s. Any suggestions for a more clear title to use for the question? This was the best I could do to describe the problem.. – Keith Hughitt Dec 09 '17 at 15:52

Facu · Answer 2 · 2020-04-23T10:43:10.737

For new version of Tidyr package (>1.0.0), now is better to use pivot_wider command instead of spread. Its more intuitive, and spread command could be deprecated in the future.

library(tidyr)

    iris %>% mutate(wide=Sepal.Width > 3) %>% 
            group_by(Species, wide) %>% 
            summarize(mean_width=mean(Sepal.Width)) %>%
            pivot_wider(names_from = wide, values_from = mean_width) %>%
            summarise(diff = `TRUE` - `FALSE`)

    #     Species      diff
    #1     setosa 0.6285714
    #2 versicolor 0.5119048
    #3  virginica 0.5117647

Computing difference between averages by group and logical values using dplyr

2 Answers2