R: Converting Large Dataframe to Pairwise Correlation Matrix

Question

I have data of the form:

df <- data.frame(group = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
                  thing = c(rep(c('a','b','c','d','e'),5)),
                  score = c(1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,1,0,1,0,1,0,1,0))

which reports the "score" for each "thing" for a bunch of "group"s.

I would like to create the correlation matrix that shows the pairwise score correlations for all "thing"s based on the correlation in their scores across groups:

         thing_a thing_b thing_c thing_d thing_e
thing_a  1       .       .       .       .
thing_b  corr    1       .       .       .
thing_c  corr    corr    1       .       .
thing_d  corr    corr    corr    1       .
thing_e  corr    corr    corr    corr    1

For example, the data underlying the correlation between thing "a" and thing "b" would be:

group  thing_a_score  thing_b_score
1      1              1
2      1              1
3      1              1
4      0              1
5      0              1

In reality, the number of unique groups is ~1,000 and the number of things is ~10,000 so I need an approach that is more efficient than a brute force for-loop.

I don't need the resulting matrix of correlations to be in a single matrix, or even in a matrix per-se (i.e., it could be a bunch of data sets with three columns "thing_1 thing_2 corr").

For this particular example, this works as well: ```cor(table(df[df$score == 1, c('group', 'thing')]))``` — Cole, Oct 14 '19 at 00:20

score 2 · Accepted Answer · answered Oct 13 '19 at 22:09

You can dcast your data first and use cor() function to get the correlation matrix:

library(data.table)
dt <- data.table(
  group = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
  thing = c(rep(c('a','b','c','d','e'),5)),
  score = c(1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,1,0,1,0,1,0,1,0)
)
dt

m <- dcast(dt, group ~ thing, value.var = "score")

cor(m[, -1])

data.table is usually performant, but if it is not working for you please write a reproducible example that generates large amount of data, somebody might benchmark speed and memory on different solutions.

Works like a charm! And easily handled the 1,000 group by 10,000 thing sized matrix. Thank you! — km5041, Oct 13 '19 at 23:09

R: Converting Large Dataframe to Pairwise Correlation Matrix

1 Answers1