1

Hi it seems that spearman correlation should produce the same result regardless if its zscore or raw. Here are two examples.

https://stats.stackexchange.com/questions/77562/why-does-correlation-come-out-the-same-on-raw-data-and-z-scored-standardized-d

https://stats.stackexchange.com/questions/13952/can-spearmans-correlation-be-run-on-z-scores

However for this example here the two correlation are different and I'm wondering what is going on.

df = read.csv("https://www.dropbox.com/s/jdktw9jugzm97v3/test.csv?dl=1", head=F)

cor(df[, 1], df[,2], method="spearman")
cor(scale(df[, 1]), scale(df[,2]), method="spearman")

# 0.8462699 vs 0.8905341

Interestingly pearson gives the same result. I'm wondering what I'm doing or thinking incorrectly here?

edit: so in addition I thought may be this is due to ties so I also use kendall which should handle ties however it also gives different results.

cor(as.matrix ( df[, 1] ) , as.matrix ( df[,2] ), method="kendall" )
cor(scale(as.matrix ( df[, 1] )), scale(as.matrix ( df[,2] )),  method="kendall")

thanks.

Ahdee
  • 4,679
  • 4
  • 34
  • 58
  • I'm not sure what's going on here but I noticed that if you add a small constant to both columns (I tried adding 1, 100, -1, and adding 1 to one column and subtracting 1 from the other), the correlation is .9157, regardless of whether you scale it or not. So I wonder if this has something to do with numerical instability; both columns have entries which are extremely close to 0 and those might be throwing things off. Spearman's correlation certainly ought to be scale invariant, since rescaling won't change the ranks. – Joseph Clark McIntyre Jan 21 '19 at 01:58
  • @JosephClarkMcIntyre yes so weird I also used the cor.test with ties, ```cor.test(df[, 1] , df[,2] , method = "spearm", exact = FALSE) cor.test(scale ( df[, 1] ) , scale ( df[,2] ) , method = "spearm", exact = FALSE) ``` still different – Ahdee Jan 21 '19 at 02:12
  • 1
    For sure this is a rounding error. You have data points orders of magnitude smaller than `.Machine$double.eps` and over 20 orders of magnitude range in the data. You can rproduce with fake data like this `df = data.frame( x = (rnorm(20,10,2) + (1:20)/2)*10^(-18:1), y = rnorm(20,20,3) + (1:20)/3 )` – dww Jan 21 '19 at 03:09
  • @dww thanks you are right. When I rounded to 15 digits the results are the same. – Ahdee Jan 21 '19 at 16:25

1 Answers1

1

Hi as mentioned above in the comments this was due to a rounding error. No one answered but I wanted to add this in case someone else stumble on a similar issue. So when I round to 15-16 digits the results are the same.

df = read.csv("https://www.dropbox.com/s/jdktw9jugzm97v3/test.csv?dl=1", head=F)

df = round(df, digits = 15)

cor(as.matrix ( df[, 1] ) , as.matrix ( df[,2] ), method="spearman" )
cor(scale(df[, 1] ), scale(df[,2] ),  method="spearman")

thanks everyone for helping with this.

Ahdee
  • 4,679
  • 4
  • 34
  • 58