1

I'm trying to create histograms per-group then return a summary. Per this answer, I can use {braces} and print to avoid issues in creating one plot then moving onto another, however this doesn't seem to acknowledge grouping:

data(mtcars)
mtcars |> 
  group_by(cyl) %T>%
  {print(ggplot(.) +
           geom_histogram(aes(x = carb)))} |> 
  summarise(meancarb = mean(carb))

The above code works insofar as it creates a single histogram then the summary, however:

mtcars %T>%
  {print(ggplot(.) +
           geom_histogram(aes(x = carb)))} |> 
  group_by(cyl) |> 
  summarise(meancarb = mean(carb))

The above code produces exactly the same output, i.e. confirming that group_by isn't being acknowledged.

Does anyone know why the grouping isn't being used to create 1 histogram per unique cyl? Ideally I'd love to work out how to use Tee pipes to do this kinda thing more often, including saving the output to unique names, before continuing onto more pipe. In general it feels like Tee pipes are underused, possibly relating to the dearth of info about them, so if anyone has any cool examples to share, that might be great for the community.

Thanks!

Edit

Following divibisan's comment about dplyr::group_map (or group_walk):

mtcars |> 
  group_by(cyl) %T>%
  group_walk(.f = ~ ggplot(.) +
              geom_histogram(aes(x = carb))) |> 
  summarise(meancarb = mean(carb, na.rm = TRUE),
            sd3 = sd(carb, na.rm = TRUE) * 3)

This creates the summary table but no plot(s). Output identical for map and walk. Output also the same if I replace %T>% with |>. Ostensibly group_walk is doing the same as %T>%. With |> and group_map, I get:

Error in UseMethod("summarise"): no applicable method for 'summarise' applied to an object of class "list"

mtcars |> 
  group_by(cyl) %T>%
  {print(group_walk(.f = ~ ggplot(.) +
              geom_histogram(aes(x = carb))))} |> 
  summarise(meancarb = mean(carb, na.rm = TRUE),
            sd3 = sd(carb, na.rm = TRUE) * 3)

With print and braces:

Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'print': argument ".data" is missing, with no default

Braces no print:

Error in group_map(.data, .f, ..., .keep = .keep): argument ".data" is missing, with no default

Print no braces: same as braces no print.

Edit2

More interesting ideas coming forth, thanks to Ricardo, this:

mtcars |> 
  group_split(cyl) |> 
  map(.f = ~ ggplot(.) +
        geom_histogram(aes(x = carb)))

Works insofar as it produces 1 plot per group. Success! But: I can't find any combination of Tee/pipes which Tees off mtcars for the group_split AND map, and then resumes the main pipe line:

mtcars %T>% 
  group_split(cyl) %T>%
  map(.f = ~ ggplot(.) +
               geom_histogram(aes(x = carb))) |>
  summarise(meancarb = mean(carb))

Error in map(): In index: 1. With name: mpg. Caused by error in fortify(): data must be a <data.frame>, or an object coercible by fortify(), not a double vector.

Also anything other than 2 pipes means the plots aren't created.

Trying this another way around, by reordering the pipe structure (which won't always be possible/desirable):

mtcars |>
  group_by(cyl) %T>%
  summarise(meancarb = mean(carb)) |> 
  ungroup() |> 
  group_split(cyl) |> 
  map(.f = ~ ggplot(.) +
        geom_histogram(aes(x = carb)))

This creates the 3 plots but doesn't print the summary. Any combination of {braces} and/or print around the summary line gives:

Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'mean': object 'carb' not found.

Does anyone know whether the Tee pipe is explicitly for a single command, i.e. you can't pipe another command onto the tee branch, and then return to the main pipe? Thanks all

Edit 3

Thanks zephyr. Followup question: how to do multi-command tee pipes without a formula-format first command?

mtcars |>
  summarise(sdd = sd(carb, na.rm = TRUE))

Works fine, prints a single value.

mtcars %T>%
  summarise(sdd = sd(carb, na.rm = TRUE)) |> 
  summarise(
    meancarb = mean(carb, na.rm = TRUE),
    sd3 = sd(carb, na.rm = TRUE) * 3
  )

Doesn't print the value, performs the calculation invisibly then continues. Any combination of print and {braces} I've tried results in:

Error: function '{' not supported in RHS call of a pipe

or

Error in is.data.frame(x) : object 'carb' not found

Say I wanted, e.g.:

mtcars  |> 
  summarise(~{
    print(sdd = sd(carb))
    write_csv(file = "tmp.csv")
    .x
  }) |> 
  summarise(meancarb = mean(carb))

Any thoughts? Thanks again!

dez93_2000
  • 1,730
  • 2
  • 23
  • 34
  • 1
    Maybe I'm missing something, but I don't think ggplot makes multiple plots when passed a grouped data frame. I think you'd need to put a [group_map](https://stackoverflow.com/a/31432029/8366499) inside the braces to make multiple plots, or facet the plot by `cyl` – divibisan Jun 09 '23 at 22:31
  • Thanks for the info, these are new to me. Updated the question after trying out a few variants. Can defo fallback to facet_plot but would be great to see if we can get this working. – dez93_2000 Jun 09 '23 at 23:05
  • 1
    Maybe `group_split` into `map` instead of `group_by` – Ricardo Semião e Castro Jun 09 '23 at 23:08

1 Answers1

2

You were on the right track with group_walk(), but you need to put the print() inside the mapped function:

library(dplyr)
library(purrr)
library(magrittr)
library(ggplot2)

mtcars |> 
  group_by(cyl) %T>%
  group_walk(~ print(
    ggplot(.) + geom_histogram(aes(x = carb))
  )) |> 
  summarise(
    meancarb = mean(carb, na.rm = TRUE),
    sd3 = sd(carb, na.rm = TRUE) * 3
  )
# A tibble: 3 × 3
    cyl meancarb   sd3
  <dbl>    <dbl> <dbl>
1     4     1.55  1.57
2     6     3.43  5.44
3     8     3.5   4.

Note you can get the same result without using %T>% by assigning the plot to a name in your anonymous function and returning the original dataframe after printing:

mtcars |> 
  group_by(cyl) |>
  group_walk(~ {
    p <- ggplot(.x) + geom_histogram(aes(x = carb))
    print(p)
    .x
  }) |> 
  summarise(
    meancarb = mean(carb, na.rm = TRUE),
    sd3 = sd(carb, na.rm = TRUE) * 3
  )
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • Thanks for this, wonderful stuff. So in e.g.1, `~print()` prints the outcome of its contents, a `ggplot` chain, and does this as a tee'd branch thanks to it being within `group_walk` within a `%T>%`; in e.g.2, `{`bracing`}` the `group_walk` formula evaluates its contents immediately, and including `.x` returns the unchanged `tbl`, giving the same outcome as `%T>%`? Wild stuff! Any reason to assign then print p vs `print(ggplot ... carb)))` ? Thanks so much for this. Edit: the second formulation seems more flexible & doesn't need magrittr, any reason not to use instead of `%T>%` always? – dez93_2000 Jun 12 '23 at 17:32
  • 1
    For e.g.1, pretty much yes -- the reason `print(group_walk())` doesn't work is `group_walk()` doesn't return anything, so there's nothing to print. You could also have done `{print(group_map(., ~ ggplot(.x) + geom_histogram(aes(x = carb))))}`, which would print the list of plots returned by `group_map()`. For e.g.2, the purpose of bracing is to include a multi-line function as an argument to `group_walk()`. – zephryl Jun 12 '23 at 17:58
  • 1
    re "any reason to assign then print p vs print(ggplot ... carb))) ?" No, you're right -- the key thing is to print the plot then return `.x`, but you don't have to assign the plot before printing. – zephryl Jun 12 '23 at 18:02
  • 1
    And to your last question, `%T>%` can be useful in simpler cases -- e.g. if you wanted to save versions of a dataframe at different steps in your pipeline, you could do `dat %T%> write.csv("all_data.csv") %>% summarize(mean_x = mean(x)) ...`. But in practice I rarely find it useful. – zephryl Jun 12 '23 at 18:09
  • Followup: how would one generalise either approach for multi-command tee branches where the first command isn't formula style? e.g. added to question for formatting. If I wanted to summarise sdd, print that out (default is invisible), save to csv, then return to the main pipe? Thanks again. – dez93_2000 Jun 12 '23 at 19:28
  • 1
    You could do `mtcars %T>% { summarise(., sdd = sd(carb)) |> print() |> write_csv("tmp.sav") } |> summarise(meancarb = mean(carb))`, putting everything you want to "skip over" in braces. This definitely isn't idiomatic, though -- in most cases I would just make two separate pipelines. – zephryl Jun 12 '23 at 20:17
  • Lovely stuff, very much appreciated. Looks like I missed the `.,`; I guess this is needed due to the braces? Typically `dplyr`/`magrittr` pipe chains automatically/invisibly populate the `data` parameter... – dez93_2000 Jun 12 '23 at 21:07
  • 1
    Yes - but enclosing in braces overrides this. See the docs for `%>%`, particularly [this section](https://magrittr.tidyverse.org/reference/pipe.html#using-lambda-expressions-with-gt-) and [this one](https://magrittr.tidyverse.org/reference/pipe.html#using-the-dot-for-secondary-purposes). – zephryl Jun 13 '23 at 01:32