How to combine dplyr group_by, summarise, across and multiple function outputs?

Question

I have the following tibble:

tTest = tibble(Cells = rep(c("C1", "C2", "C3"), times = 3), 
               Gene = rep(c("G1", "G2", "G3"), each = 3), 
               Experiment_score = 1:9, 
               Pattern1 = 1:9, 
               Pattern2 = -(1:9), 
               Pattern3 = 9:1) %>%
        group_by(Gene)

and I would like to correlate the Experiment_score with each of the Pattern columns for all Gene.

Looking at the tidyverse across page and examples, I thought this would work:

# `corList` is a simple wrapper for `cor` to have exactly two outputs:
corList = function(x, y) {
    result = cor.test(x, y)
    return(list(stat = result$estimate, pval = result$p.value))
}

tTest %>% summarise(across(starts_with("Pattern"), ~ corList(Experiment_score, .x), .names = "{.col}_corr_{.fn}"))

but I got this:

I have found a solution by melting the Pattern columns and I will post it down below for completeness but the challenge is that I have dozens of Pattern columns and millions of rows. If I melt the Pattern columns, I end up with half a billion rows, seriously hampering my ability to work with the data.

EDIT: My own imperfect solution:

# `corVect` is a simple wrapper for `cor` to have exactly two outputs:
corVect = function(x, y) {
    result = cor.test(x, y)
    return(c(stat = result$estimate, pval = result$p.value))
}

tTest %>% pivot_longer(starts_with("Pattern"), names_to = "Pattern", values_to = "Strength") %>%
      group_by(Gene, Pattern) %>%
      summarise(CorrVal = corVect(Experiment_score, Strength)) %>% 
      mutate(CorrType = c("corr", "corr_pval")) %>%
      # Reformat
      pivot_wider(id_cols = c(Gene, Pattern), names_from = CorrType, values_from = CorrVal)

Andy Baxter · Accepted Answer · 2022-11-25T10:56:46.793

1

To get the desired result in one step, wrap the function return as a tibble rather than a list, and call .unpack = TRUE in across. Here using a conveniently-named corTibble function:

library(tidyverse)

tTest = tibble(
  Cells = rep(c("C1", "C2", "C3"), times = 3),
  Gene = rep(c("G1", "G2", "G3"), each = 3),
  Experiment_score = 1:9,
  Pattern1 = 1:9 + rnorm(9),  # added some noise
  Pattern2 = -(1:9 + rnorm(9)),
  Pattern3 = 9:1 + rnorm(9)
) %>%
  group_by(Gene)

corTibble = function(x, y) {
  result = cor.test(x, y)
  return(tibble(stat = result$estimate, pval = result$p.value))
}

tTest %>% summarise(across(
  starts_with("Pattern"),
  ~ corTibble(Experiment_score, .x),
  .names = "{.col}_corr",
  .unpack = TRUE
))

#> # A tibble: 3 × 7
#>   Gene  Pattern1_corr_stat Pattern1_corr_pval Pattern2…¹ Patte…² Patte…³ Patte…⁴
#>   <chr>              <dbl>              <dbl>      <dbl>   <dbl>   <dbl>   <dbl>
#> 1 G1                 0.947             0.208      -0.991  0.0866  -1.00   0.0187
#> 2 G2                 0.964             0.172      -0.872  0.325   -0.981  0.126 
#> 3 G3                 0.995             0.0668     -0.680  0.524   -0.409  0.732 
#> # … with abbreviated variable names ¹Pattern2_corr_stat, ²Pattern2_corr_pval,
#> #   ³Pattern3_corr_stat, ⁴Pattern3_corr_pval

edited Nov 25 '22 at 10:56

answered Nov 24 '22 at 12:13

Andy Baxter

5,833
1
8
22

1

Thanks, Andy! As I want to avoid the double calculation step because I would like to also use it for more complicated functions, I will go with the second, more messy solution. What I really don't understand is how `list(x=fun(.x)[[1]], y=fun(.)[[2]]` returns something different than `fun(.x)` returning a named list? Do you have any ideas why? – Martingales Nov 25 '22 at 09:54
1

Y'know what, in trying to figure out what the steps in the outputting of data were to see what was happening, I came across the `.unpack` option which will do exactly what you need it to when the function output is a dataframe/tibble. Have amended answer above. As to why not returning named list in first instance, I think it's because `across` is expecting a single column per function run and is processing it as such. If you look at the `across` code it seems to be structuring output as 'number of fuctions'*'number of data columns', even if function returns multiple values. – Andy Baxter Nov 25 '22 at 10:55
Mh, this is interesting. When using the `.unpack` parameter, then the results at my end look different to the example. This is what my header looks like: `Gene Pattern1_corr$stat $pval Pattern2_corr$stat $pval Pattern3_corr$stat $pval`. (tidyversye version 1.3.1 and R 4.1.3) For my larger data sets, I also observe that the `.unpack` parameter is slower than the "messy" solution we had before. – Martingales Nov 25 '22 at 12:41
1

Ah I think `.unpack` is a recent addition to `dplyr` [introduced in August this year](https://github.com/tidyverse/dplyr/pull/6429), I'm running dplyr v1.0.99.9000. You can try updating but if old solution still works then that's at least a smooth way forward! Would it be helpful to edit answer to include both? – Andy Baxter Nov 25 '22 at 13:13

How to combine dplyr group_by, summarise, across and multiple function outputs?

1 Answers1