1

I have the following tibble:

tTest = tibble(Cells = rep(c("C1", "C2", "C3"), times = 3), 
               Gene = rep(c("G1", "G2", "G3"), each = 3), 
               Experiment_score = 1:9, 
               Pattern1 = 1:9, 
               Pattern2 = -(1:9), 
               Pattern3 = 9:1) %>%
        group_by(Gene)

and I would like to correlate the Experiment_score with each of the Pattern columns for all Gene.

Looking at the tidyverse across page and examples, I thought this would work:

# `corList` is a simple wrapper for `cor` to have exactly two outputs:
corList = function(x, y) {
    result = cor.test(x, y)
    return(list(stat = result$estimate, pval = result$p.value))
}

tTest %>% summarise(across(starts_with("Pattern"), ~ corList(Experiment_score, .x), .names = "{.col}_corr_{.fn}"))

but I got this: enter image description here

I have found a solution by melting the Pattern columns and I will post it down below for completeness but the challenge is that I have dozens of Pattern columns and millions of rows. If I melt the Pattern columns, I end up with half a billion rows, seriously hampering my ability to work with the data.

EDIT: My own imperfect solution:

# `corVect` is a simple wrapper for `cor` to have exactly two outputs:
corVect = function(x, y) {
    result = cor.test(x, y)
    return(c(stat = result$estimate, pval = result$p.value))
}

tTest %>% pivot_longer(starts_with("Pattern"), names_to = "Pattern", values_to = "Strength") %>%
      group_by(Gene, Pattern) %>%
      summarise(CorrVal = corVect(Experiment_score, Strength)) %>% 
      mutate(CorrType = c("corr", "corr_pval")) %>%
      # Reformat
      pivot_wider(id_cols = c(Gene, Pattern), names_from = CorrType, values_from = CorrVal)
Martingales
  • 169
  • 1
  • 8

1 Answers1

1

To get the desired result in one step, wrap the function return as a tibble rather than a list, and call .unpack = TRUE in across. Here using a conveniently-named corTibble function:

library(tidyverse)

tTest = tibble(
  Cells = rep(c("C1", "C2", "C3"), times = 3),
  Gene = rep(c("G1", "G2", "G3"), each = 3),
  Experiment_score = 1:9,
  Pattern1 = 1:9 + rnorm(9),  # added some noise
  Pattern2 = -(1:9 + rnorm(9)),
  Pattern3 = 9:1 + rnorm(9)
) %>%
  group_by(Gene)

corTibble = function(x, y) {
  result = cor.test(x, y)
  return(tibble(stat = result$estimate, pval = result$p.value))
}

tTest %>% summarise(across(
  starts_with("Pattern"),
  ~ corTibble(Experiment_score, .x),
  .names = "{.col}_corr",
  .unpack = TRUE
))

#> # A tibble: 3 × 7
#>   Gene  Pattern1_corr_stat Pattern1_corr_pval Pattern2…¹ Patte…² Patte…³ Patte…⁴
#>   <chr>              <dbl>              <dbl>      <dbl>   <dbl>   <dbl>   <dbl>
#> 1 G1                 0.947             0.208      -0.991  0.0866  -1.00   0.0187
#> 2 G2                 0.964             0.172      -0.872  0.325   -0.981  0.126 
#> 3 G3                 0.995             0.0668     -0.680  0.524   -0.409  0.732 
#> # … with abbreviated variable names ¹​Pattern2_corr_stat, ²​Pattern2_corr_pval,
#> #   ³​Pattern3_corr_stat, ⁴​Pattern3_corr_pval
Andy Baxter
  • 5,833
  • 1
  • 8
  • 22
  • 1
    Thanks, Andy! As I want to avoid the double calculation step because I would like to also use it for more complicated functions, I will go with the second, more messy solution. What I really don't understand is how `list(x=fun(.x)[[1]], y=fun(.)[[2]]` returns something different than `fun(.x)` returning a named list? Do you have any ideas why? – Martingales Nov 25 '22 at 09:54
  • 1
    Y'know what, in trying to figure out what the steps in the outputting of data were to see what was happening, I came across the `.unpack` option which will do exactly what you need it to when the function output is a dataframe/tibble. Have amended answer above. As to why not returning named list in first instance, I think it's because `across` is expecting a single column per function run and is processing it as such. If you look at the `across` code it seems to be structuring output as 'number of fuctions'*'number of data columns', even if function returns multiple values. – Andy Baxter Nov 25 '22 at 10:55
  • Mh, this is interesting. When using the `.unpack` parameter, then the results at my end look different to the example. This is what my header looks like: `Gene Pattern1_corr$stat $pval Pattern2_corr$stat $pval Pattern3_corr$stat $pval`. (tidyversye version 1.3.1 and R 4.1.3) For my larger data sets, I also observe that the `.unpack` parameter is slower than the "messy" solution we had before. – Martingales Nov 25 '22 at 12:41
  • 1
    Ah I think `.unpack` is a recent addition to `dplyr` [introduced in August this year](https://github.com/tidyverse/dplyr/pull/6429), I'm running dplyr v1.0.99.9000. You can try updating but if old solution still works then that's at least a smooth way forward! Would it be helpful to edit answer to include both? – Andy Baxter Nov 25 '22 at 13:13