7

I am attempting to calculate the correlation between all the rows of a large data frame, and so far have come up with a simple for-loop that works. For example:

name <- c("a", "b", "c", "d")
col1 <- c(43.78, 43.84, 37.92, 31.72)
col2 <- c(43.80, 43.40, 37.64, 31.62)
col3 <- c(43.14, 42.85, 37.54, 31.74)
df <- data.frame(name, col1, col2, col3)
cor.df <- data.frame(name1=NA, name2=NA,correl=NA)

for(i in 1: (nrow(df) - 1))  {
  for(j in (i+1): nrow(df) ) {
    v1 <- as.numeric( df[i, 2:ncol(df)] )
    v2 <- as.numeric( df[j, 2:ncol(df)] )
    correl <- cor(v1, v2)

    name1 <- df[i, "name"]
    name2 <- df[j, "name"]

    dftemp <- data.frame(name1, name2, correl)
    cor.df <- rbind(cor.df, dftemp)
   }
}

na.omit(cor.df)

#    name1 name2     correl
#     a     b      0.8841255
#     a     c      0.6842705
#     a     d     -0.6491118
#     b     c      0.9457125
#     b     d     -0.2184630
#     c     d      0.1105508

Given the large data frame and the inefficient for-loop, the correlation computation takes a long time. Would anyone have any suggestions as to how to make it faster? Note that I have many data frames in a list, so I can use lapply (but have not figured out how to write the line of code)

Sotos
  • 51,121
  • 6
  • 32
  • 66
fragf
  • 81
  • 1
  • 3
  • Use a matrix, not a data frame. Data frames are built to work with columns. Any time you are treating rows of data frames as vectors you should either transpose your data frame so they are column or convert to a matrix. – Gregor Thomas Oct 30 '17 at 13:57

1 Answers1

8

Drop the first column, transpose and use base::cor function:

> cor(t(df[-1]))
           [,1]       [,2]      [,3]       [,4]
[1,]  1.0000000  0.8841255 0.6842705 -0.6491118
[2,]  0.8841255  1.0000000 0.9457125 -0.2184630
[3,]  0.6842705  0.9457125 1.0000000  0.1105508
[4,] -0.6491118 -0.2184630 0.1105508  1.0000000

# pretty output
x <- cor(t(df[, -1]))
x[upper.tri(x, diag = TRUE)] <- NA
rownames(x) <- colnames(x) <- df$name
x <- na.omit(reshape::melt(t(x)))
x <- x[ order(x$X1, x$X2), ]

x
#    X1 X2      value
# 5   a  b  0.8841255
# 9   a  c  0.6842705
# 13  a  d -0.6491118
# 10  b  c  0.9457125
# 14  b  d -0.2184630
# 15  c  d  0.1105508
zx8754
  • 52,746
  • 12
  • 114
  • 209
amrrs
  • 6,215
  • 2
  • 18
  • 27
  • 1
    Thanks @zx8754, that also teaches me how to write better answers. – amrrs Oct 30 '17 at 14:14
  • Thank you... the calculations are now very quick – fragf Oct 30 '17 at 15:12
  • After you obtain the correlation of each pair of variables, how could you then identify all groups of variables such that all variables within a group have correlation with every other variable in the same group of at least some number (minimum group size 2)? Something like Group A contains seven variables that all have correlation at least 0.8 with each other and Group B contains four variables with correlation at least 0.8 with each other. – Dario Mar 05 '21 at 05:00