Subtracting cell in one row from cell in another row when summarizing grouped data with dplyr?

Question

Background: I have data from a simulation where I have a few variables and thus many resulting combinations of parameters. Due to the internal design of the simulation there can be a little variation among the outcomes of identical sets of parameters, so I run a number of identical runs, then calculate their min, max, and mean score. Then, I want to compare the treatment and no-treatment conditions:

calculate the mean of treatment minus no-treatment
calculate the difference of the min score of treatment minus max score of no-treatment
calculate the difference of the max score of treatment minus min score of no-treatment

This gives me the mean difference but also the bounds of the best- and worst-case comparison.

Example data:

my_data <- tribble(
  ~params, ~treatment, ~mean_score, ~min_score,  ~max_score,
  "combo a", 0, 91,  90, 92,
  "combo a", 1, 92,  92, 92,
  "combo b", 0, 89,  87, 91,
  "combo b", 1, 92,  89, 92,
  "combo c", 0, 90,  90, 90,
  "combo c", 1, 89,  85, 93,
)

Blowing the dust off my R skills, my initial attempt is the following, but I do not know how to tell summarize which row should be subtracted from which within the grouping.

Code attempt I know doesn't work:

my_summ_data <- mydata %>%
  dplyr::group_by(params = as.factor(params)) %>%
  dplyr::summarize(hier_diff=diff(mean_score), 
                   min_max_diff=diff(c(min_score, max_score)),
                   max_min_diff=diff(c(max_score, min_score)) )

I would like to get

params	hier_diff	min_max_diff	max_min_diff
combo a	1	0	2
combo b	3	-2	5
combo c	-1	-5	3

but instead I get (btw I don't yet understand why I get these extra rows)

params	hier_diff	min_max_diff	max_min_diff
combo a	1	2	0
combo a	1	0	-2
combo a	1	0	2
combo b	1	2	0
combo b	1	2	-4
combo b	1	0	2
combo c	2	-2	6
combo c	2	2	-6
combo c	2	6	-2

I'm not convinced there is a sensible way to do what I want using summarize. But if there is, I would like to know it, and if not, what is the next best alternative?

lovalery · Answer 1 · 2022-02-14T23:05:38.420

1

Please find below one possible solution.

Reprex

Code

library(dplyr)
library(tibble)


my_summ_data <- my_data %>%
  dplyr::group_by(params) %>%
  dplyr::arrange(treatment) %>% 
  dplyr::summarize(hier_diff=diff(mean_score), 
                   min_max_diff=diff(c(max_score[1], min_score[2])),
                   max_min_diff=diff(c(min_score[1], max_score[2])))

Output

my_summ_data
#> # A tibble: 3 x 4
#>   params  hier_diff min_max_diff max_min_diff
#>   <chr>       <dbl>        <dbl>        <dbl>
#> 1 combo a         1            0            2
#> 2 combo b         3           -2            5
#> 3 combo c        -1           -5            3

^{Created on 2022-02-14 by the reprex package (v2.0.1)}

edited Feb 14 '22 at 23:05

answered Feb 14 '22 at 22:22

lovalery

4,524
3
14
28

This works with the example, but it works with an assumption, I think. Indexing the rows like this requires the dataframe to always have the rows in order, right? Meaning if we modified the example tibble to have some rows where `treatment` = 1 came before `treatment` = 0 in the same `params` group, we'd get a different answer. How can we ensure treatment1 - treatment0 regardless of the row order? – Stan Rhodes Feb 14 '22 at 23:03
1

Hi @Stan Rhodes, O.K. to make sure that the rows are always in the same order, you can add `dplyr::arrange(treatment)` (cf. my edit above) and this way it will always work in the right way. Cheers. – lovalery Feb 14 '22 at 23:07
1

I accepted Jon Spring's answer as the official answer because it's slightly simpler without the `arrange()` and makes the target values for the variables explicit. But, of course, I still think this is a valuable contribution! – Stan Rhodes Feb 17 '22 at 19:58
@Stan Rhodes, thank you very much for your feedback and the justification of the choice for the validation of the answer. Cheers. – lovalery Feb 18 '22 at 11:33

Jon Spring · Accepted Answer · 2022-02-17T19:50:23.870

1

my_data %>%
  dplyr::group_by(params = as.factor(params)) %>%
  dplyr::summarize(
    hier_diff= mean_score[treatment==1]       - mean_score[treatment==0],
    min_max_diff=min_score[treatment==1] - max_score[treatment==0],   # EDIT -- removed unneeded min/max
    max_min_diff=max_score[treatment==1] - min_score[treatment==0]    # EDIT -- removed unneeded min/max
  )

Result

# A tibble: 3 x 4
  params  hier_diff min_max_diff max_min_diff
  <fct>       <dbl>        <dbl>        <dbl>
1 combo a         1            0            2
2 combo b         3           -2            5
3 combo c        -1           -5            3

Note that the answer is the same even if the treatment rows appear appear prior to the no-treatment rows, eg:

my_data <- tribble(
  ~params, ~treatment, ~mean_score, ~min_score,  ~max_score,
  "combo a", 1, 92,  92, 92,  # swapped rows 1+2, 3+4, 5+6
  "combo a", 0, 91,  90, 92,
  "combo b", 1, 92,  89, 92,
  "combo b", 0, 89,  87, 91,
  "combo c", 1, 89,  85, 93,
  "combo c", 0, 90,  90, 90,
)

edited Feb 17 '22 at 19:50

answered Feb 14 '22 at 23:18

Jon Spring

55,165
4
35
53

Hi Jon Spring, it seems to me that your answer works only imperfectly because if you reverse the order of the treatments for one of the groups (see the comment made by @Stan Rhodes under my answer), the value for `hier_diff` will not be the right one. As in the case of my answer, I think you need to add `dplyr::arrange(treatment)` for your code to work in any case. Cheers. – lovalery Feb 14 '22 at 23:31
No, I get the same answer regardless of row order, because the comparisons are based on the values in `Treatment` and not their order of appearance. – Jon Spring Feb 14 '22 at 23:53
Yes, I understand for `min_max_diff` and `max_min_diff` but that is not true for `hier_diff=diff(mean_score)` : using your code and your data, you get `-3` instead of `3` for `combo b`. Does this make sense? Or am I wrong? – lovalery Feb 14 '22 at 23:57
1

Thank you for pointing that out - fixed. – Jon Spring Feb 14 '22 at 23:59
@JonSpring I notice you're using the `min` and `max` function where there is only one value. It seems like they're not doing any work. Is there a reason for this I'm not seeing? – Stan Rhodes Feb 17 '22 at 19:46
1

I think you're right, thanks. I've updated to take those out since (at least in the example data) they're not needed. – Jon Spring Feb 17 '22 at 19:50

Subtracting cell in one row from cell in another row when summarizing grouped data with dplyr?

2 Answers2