3

I have a table like this:

A   B   C   D   E
7   1   6   8   7
9   3   9   5   9
4   6   2   1   10
10  5   3   4   1
1   3   5   9   3
6   4   8   7   6

I am in the process of finding the correlation of each variable with every other variable in the table. This is the R code I use:

test <- read.csv("D:/AB/test.csv")
iterations <- ncol(test)
correlation <- matrix(ncol = 3 , nrow = iterations * iterations)
for (k in 1:iterations) {
  for (l in 1:iterations){
    corr <- cor(test[,k], test[,l])
    corr_string_A <- names(test[k])
    corr_string_B <- names(test[l])
    correlation[l + ((k-1) * iterations),] <- rbind(corr_string_A, corr_string_B, corr)        
  }
}

The following is the output that I received:

   Var1 Var2       value
1     A    A  1.00000000
2     B    A  0.50018605
3     C    A -0.35747393
4     D    A -0.25670054
5     E    A -0.02974821
6     A    B  0.50018605
7     B    B  1.00000000
8     C    B  0.56070716
9     D    B  0.46164928
10    E    B  0.16813991
11    A    C -0.35747393
12    B    C  0.56070716
13    C    C  1.00000000
14    D    C  0.52094589
15    E    C  0.23190036
16    A    D -0.25670054
17    B    D  0.46164928
18    C    D  0.52094589
19    D    D  1.00000000
20    E    D -0.39223227
21    A    E -0.02974821
22    B    E  0.16813991
23    C    E  0.23190036
24    D    E -0.39223227
25    E    E  1.00000000

However, I don't want the values from the upper triangle; i.e., no diagonal values should occur, and each unique combination should appear only once. The final output should look like:

   Var1 Var2       value
1     B    A  0.50018605
2     C    A -0.35747393
3     D    A -0.25670054
4     E    A -0.02974821
5     C    B  0.56070716
6     D    B  0.46164928
7     E    B  0.16813991
8     D    C  0.52094589
9     E    C  0.23190036
10    E    D -0.39223227

I understand that there are a few techniques like reshape using which the above output can be achieved, but I want to make the above R code to suit and produce the above mentioned results.

I believe the "n" in the second for loop should be made to change dynamically which can help achieving this. However I am not sure how to make this work.

josliber
  • 43,891
  • 12
  • 98
  • 133
Arun
  • 625
  • 3
  • 10
  • 20
  • 3
    Can't you just do `cor(test)`. From your output, I guess, `correlation[with(correlation, Var1!=Var2),]` – akrun Dec 10 '15 at 18:27
  • 3
    Second example in `?cor`... – Henrik Dec 10 '15 at 18:31
  • @akrun - Thank you. But due to some specific work purpose, the above R code will suit me rather than the cor(test) or reshape funciton etc. – Arun Dec 10 '15 at 18:35
  • I fail to understand the reason of yours to adapt to reshape or the answer that is provided by josilber – Bg1850 Dec 10 '15 at 18:41
  • @Bg1850 - My dataset at my assignment contains many columns with zeroes. I cant calculate the correaltion for the columns with zeroes. For those columns i need to implement a seperate logic which based on certain conditions i will placing with zero or one (this is based on the business requirement). So this code will be helpful. Also the code was already working. But the big problem that i face is the diagonal values and repeatation of the pairs (for example, AB & BA - i just need one) – Arun Dec 10 '15 at 18:46
  • 2
    Regarding "the big problem that i face is the diagonal values and repeatation of the pairs": `m <- cor(test)`; `m[lower.tri(m)]`. `lower.tri` is _also_ mentioned in `?cor`. – Henrik Dec 10 '15 at 19:02

2 Answers2

6

You can convert your correlation matrix to the 3-column format with as.data.frame and as.table, and then limiting to values above or below the diagonal can be done with subset.

subset(as.data.frame(as.table(cor(dat))),
       match(Var1, names(dat)) > match(Var2, names(dat)))
#    Var1 Var2        Freq
# 2     B    A -0.02299154
# 3     C    A  0.23155350
# 4     D    A -0.28036851
# 5     E    A -0.05230260
# 8     C    B -0.58384036
# 9     D    B -0.80175393
# 10    E    B  0.00000000
# 14    D    C  0.52094589
# 15    E    C  0.23190036
# 20    E    D -0.39223227

Note that for larger datasets this should be much more efficient than separately calling cor on pairs of variables because cor is vectorized, and further it's clearly a lot less typing.

If you really must keep the looping code, then you can achieve your desired result with small changes to the pair of for loops and some book keeping about the row of correlation that you are computing:

iterations <- ncol(test)
correlation <- matrix(ncol = 3 , nrow = choose(iterations, 2))
pos <- 1
for (k in 2:iterations) {
  for (l in 1:(k-1)){
    corr <- cor(test[,k], test[,l])
    corr_string_A <- names(test[k])
    corr_string_B <- names(test[l])
    correlation[pos,] <- rbind(corr_string_A, corr_string_B, corr)        
    pos <- pos+1
  }
}

However I really wouldn't suggest this looping solution; it would be better to use the one-liner I provided and then to handle all generated NA values afterward.

Community
  • 1
  • 1
josliber
  • 43,891
  • 12
  • 98
  • 133
2

From the OP's loop output, we can subset the rows,

df1[!duplicated(t(apply(df1[1:2], 1, sort))) & df1[,1]!=df1[,2],]
#   Var1 Var2       value
#2     B    A  0.50018605
#3     C    A -0.35747393
#4     D    A -0.25670054
#5     E    A -0.02974821
#8     C    B  0.56070716
#9     D    B  0.46164928
#10    E    B  0.16813991
#14    D    C  0.52094589
#15    E    C  0.23190036
#20    E    D -0.39223227

Or as I mentioned (first) in the comments, just use

cor(test)
akrun
  • 874,273
  • 37
  • 540
  • 662