How to create tidy correlations using nested dataframes?

Question

This question has been partially answered previously (e.g., here), but – as far as I can tell – there is no full answer using a reproducible example. I would like to select variables by name from a nested data frame, calculate pairwise correlations, and then add the correlation coefficients and p-values to the unnested data frame with appropriately names columns. The following example yields the desired outcome:

library(tidyverse)
library(broom)

df <- mtcars %>% 
  nest(data = everything()) %>% 
  mutate(cor_test = map(data, ~ cor.test(.x$mpg, .x$disp)),
         tidied = map(cor_test, tidy)) %>% 
  unnest(tidied) %>% 
  select(-c(cor_test, statistic, parameter, conf.low, conf.high, method, alternative)) %>% 
  rename(c(mpg_disp_estimate = estimate, mpg_disp_p.value = p.value)) %>% 
  mutate(cor_test = map(data, ~ cor.test(.x$mpg, .x$cyl)),
         tidied = map(cor_test, tidy)) %>% 
  unnest(tidied) %>% 
  select(-c(cor_test, statistic, parameter, conf.low, conf.high, method, alternative)) %>% 
  rename(c(mpg_cyl_estimate = estimate, mpg_cyl_p.value = p.value)) %>% 
  mutate(cor_test = map(data, ~ cor.test(.x$disp, .x$cyl)),
         tidied = map(cor_test, tidy)) %>% 
  unnest(tidied) %>% 
  select(-c(cor_test, statistic, parameter, conf.low, conf.high, method, alternative)) %>% 
  rename(c(disp_cyl_estimate = estimate, disp_cyl_p.value = p.value))

Obviously, this is not a good solution, since it involves repeating the same code over and over again. Is there a way to accomplish this goal more elegantly with purrr and broom?

akrun · Accepted Answer · 2021-04-21T21:34:19.127

We could do this with combn. Get a pairwise combination of column names of data with combn, extract the column values from the data, apply cor.test, return the tidyied output, create a 'categ' column to identify the columns used in the test, and bind the list of tibble output to a single data.frame

library(dplyr)
library(broom)
library(stringr)
out <- combn(names(mtcars), 2, FUN = function(x)  
           tidy(cor.test(mtcars[[x[1]]], mtcars[[x[2]]])) %>% 
       mutate(categ = str_c(x, collapse="_"), .before = 1), 
          simplify = FALSE) %>%
    bind_rows

-output

out
# A tibble: 55 x 9
#   categ    estimate statistic  p.value parameter conf.low conf.high method                               alternative
#   <chr>       <dbl>     <dbl>    <dbl>     <int>    <dbl>     <dbl> <chr>                                <chr>      
# 1 mpg_cyl    -0.852     -8.92 6.11e-10        30  -0.926     -0.716 Pearson's product-moment correlation two.sided  
# 2 mpg_disp   -0.848     -8.75 9.38e-10        30  -0.923     -0.708 Pearson's product-moment correlation two.sided  
# 3 mpg_hp     -0.776     -6.74 1.79e- 7        30  -0.885     -0.586 Pearson's product-moment correlation two.sided  
# 4 mpg_drat    0.681      5.10 1.78e- 5        30   0.436      0.832 Pearson's product-moment correlation two.sided  
# 5 mpg_wt     -0.868     -9.56 1.29e-10        30  -0.934     -0.744 Pearson's product-moment correlation two.sided  
# 6 mpg_qsec    0.419      2.53 1.71e- 2        30   0.0820     0.670 Pearson's product-moment correlation two.sided  
# 7 mpg_vs      0.664      4.86 3.42e- 5        30   0.410      0.822 Pearson's product-moment correlation two.sided  
# 8 mpg_am      0.600      4.11 2.85e- 4        30   0.318      0.784 Pearson's product-moment correlation two.sided  
# 9 mpg_gear    0.480      3.00 5.40e- 3        30   0.158      0.710 Pearson's product-moment correlation two.sided  
#10 mpg_carb   -0.551     -3.62 1.08e- 3        30  -0.755     -0.250 Pearson's product-moment correlation two.sided  
# … with 45 more rows

If we want to create a wide format, use pivot_wider

library(tidyr)
out1 <- combn(names(mtcars), 2, FUN = function(x)  
       tidy(cor.test(mtcars[[x[1]]], mtcars[[x[2]]])) %>% 
   mutate(categ = str_c(x, collapse="_"), .before = 1), 
      simplify = FALSE) %>%
  bind_rows %>% 
  select(categ, estimate, p.value) %>%
  pivot_wider(names_from = categ, values_from = c(estimate, p.value))

If we want to use in a nested data, wrap the above code in a function and map over the list 'data' column

library(purrr)
f1 <- function(dat) {
     combn(names(dat), 2, FUN = function(x)  
      tidy(cor.test(dat[[x[1]]], dat[[x[2]]])) %>% 
       mutate(categ = str_c(x, collapse="_"), .before = 1), 
     simplify = FALSE) %>%
     bind_rows %>% 
     select(categ, estimate, p.value) %>%
     pivot_wider(names_from = categ, values_from = c(estimate, p.value))
   }

mtcars %>%
    nest(data = everything()) %>%
    mutate(out = map(data, f1))
# A tibble: 1 x 2
#  data               out               
#  <list>             <list>            
#1 <tibble [32 × 11]> <tibble [1 × 110]>

Thanks, as always, for your great suggestions. This is a helpful start, but -- as noted in the question -- I'm trying to find an approach that also (a) drops all of the returned values except estimate and p.value, and (b) creates column names combining the relevant variables (e.g., disp_cyl_estimate). Is there a way to do that? — nicholas, Apr 21 '21 at 02:57
One more question -- this solution works for mtcar, but my question pertains to a nested dataframe (my actual use case). How would you apply this solution to a nested dataframe? — nicholas, Apr 21 '21 at 21:14
@nicholas why do you need to use in a nested dataset. In the example you have only a single data in nested way. Can't you just `unnest` and use the same code — akrun, Apr 21 '21 at 21:28
@nicholas do you have multiple data in the nest or else extract the data `nest_mt$data[[1]]` and use the same code — akrun, Apr 21 '21 at 21:30
In my real use case I have both nested and unnested data frames. I want to analyze correlations within the nested dataframes and then store the results at the unnested level. I could unnest, use ```group_by``` to calculate the correlations, and then re-nest, but that seems inefficient. — nicholas, Apr 21 '21 at 21:32
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231434/discussion-between-nicholas-and-akrun). — nicholas, Apr 21 '21 at 21:34

How to create tidy correlations using nested dataframes?

1 Answers1