Calculate pairwise correlation in R using dplyr::mutate

Question

I have a large data frame with on every rows enough data to calculate a correlation using specific columns of this data frame and add a new column containing the correlations calculated.

Here is a summary of what I would like to do (this one using dplyr):

example_data %>%
mutate(pearsoncor = cor(x = X001_F5_000_A:X030_F5_480_C, y = X031_H5_000_A:X060_H5_480_C))

Obviously it is not working this way as I get only NA's in the pearsoncor column, does anyone has a suggestion? Is there an easy way to do this?

Best,

Example data frame

It wouldn't work because you are not correctly using it. Try `diag(cor(t(example_data[columnnames]), t(example_data[columnnames])))` Or with `purrr` `map2_dbl(as.data.frame(t(example_data[columnnames])), as.data.frame(t(example_data[columnnames])), cor)` — akrun, Dec 31 '17 at 10:05
I suggest you to review your question and try to get an answer on https://stats.stackexchange.com. I think is more a statistical problem than coding one for you. — Scipione Sarlo, Dec 31 '17 at 11:32
Example_data is no longer available at the link, which makes this question not very helpful — Ira S, Mar 07 '23 at 21:59

MarkusN · Accepted Answer · 2020-03-19T19:50:34.013

With tidyr, you can gather separately all x- and y-variables, you'd like to compare. You get a tibble containing the correlation coefficients and their p-values for every combination you provided.

library(dplyr)
library(tidyr)

example_data %>%
  gather(x_var, x_val, X001_F5_000_A:X030_F5_480_C) %>% 
  gather(y_var, y_val, X031_H5_000_A:X060_H5_480_C) %>% 
  group_by(x_var, y_var) %>% 
  summarise(cor_coef = cor.test(x_val, y_val)$estimate,
            p_val = cor.test(x_val, y_val)$p.value)

edit, update some years later:

library(tidyr)
library(purrr)
library(broom)
library(dplyr)

longley %>%
  pivot_longer(GNP.deflator:Armed.Forces, names_to="x_var", values_to="x_val") %>% 
  pivot_longer(Population:Employed, names_to="y_var", values_to="y_val") %>% 
  nest(data=c(x_val, y_val)) %>%
  mutate(cor_test = map(data, ~cor.test(.x$x_val, .x$y_val)),
         tidied = map(cor_test, tidy)) %>% 
  unnest(tidied)

Is there a way to call only once the `cor.test` function and then assessing their values with `$`? — Guilherme Parreira, Mar 19 '20 at 18:45
It is indeed possible, using purrr's map() and broom's tidy() functions. see edit of answer — MarkusN, Mar 19 '20 at 19:51

score 1 · Answer 2 · answered Jan 03 '18 at 16:40

Here is a solution using the reshape2 package to melt() the data frame into long form so that each value has its own row. The original wide-form data has 60 values per row for each of the 6 genes, while the melted long-form data frame has 360 rows, one for each value. Then we can easily use summarize() from dplyr to calculate the correlations without loops.

library(reshape2)
library(dplyr)

names1 <- names(example_data)[4:33]
names2 <- names(example_data)[34:63]

example_data_longform <- melt(example_data, id.vars = c('Gene','clusterFR','clusterHR'))

example_data_longform %>%
  group_by(Gene, clusterFR, clusterHR) %>%
  summarize(pearsoncor = cor(x = value[variable %in% names1],
                             y = value[variable %in% names2]))

You could also generate more detailed results, as in Eudald's answer, using do():

detailed_r <- example_data_longform %>%
  group_by(Gene, clusterFR, clusterHR) %>%
  do(cor = cor.test(x = .$value[.$variable %in% names1],
                    y = .$value[.$variable %in% names2]))

This outputs a tibble with the cor column being a list with the results of cor.test() for each gene. We can use lapply() to extract output from the list.

lapply(detailed_r$cor, function(x) c(x$estimate, x$p.value))

score 0 · Answer 3 · answered Dec 31 '17 at 12:09

0

I had the same problem a few days back, and I know loops are not optimal in R but that's the only thing I could think of:

df$r = rep(0,nrow(df))
df$cor_p = rep(0,nrow(df))

for (i in 1:nrow(df)){
  ct = cor.test(as.numeric(df[i,cols_A]),as.numeric(df[i,cols_B]))
df$r[i] = ct$estimate
df$cor_p[i] = ct$p.value
}

answered Dec 31 '17 at 12:09

Eudald

358
3
12

Many thanks Eudald, I used a similar loop as a workaround while searching for an efficient solution. With my dataset the loop takes about 5 minutes to complete :-/ – Yvan Dec 31 '17 at 12:18
Perfect amount of time to grab a cup of coffee ;-) (I'll try to think of something more efficient!) – Eudald Dec 31 '17 at 12:28

Calculate pairwise correlation in R using dplyr::mutate

3 Answers3

Linked