33

Cor.test() takes vectors x and y as arguments, but I have an entire matrix of data that I want to test, pairwise. Cor() takes this matrix as an argument just fine, and I'm hoping to find a way to do the same for cor.test().

The common advice from other folks seems to be to use cor.prob():

https://stat.ethz.ch/pipermail/r-help/2001-November/016201.html

But these p-values are not the same as those generated by cor.test()!!! Cor.test() also seems better equipped to handle pairwise deletion (I have quite a bit of missing data in my data set) than cor.prob().

Does anybody have any alternatives to cor.prob()? If the solution involves nested for loops, so be it (I'm new enough to R for even this to be problematic for me).

Richard Erickson
  • 2,568
  • 8
  • 26
  • 39
Atticus29
  • 4,190
  • 18
  • 47
  • 84
  • You could use `lapply` with `cor.test` or vectorize the function and feed it to `outer` as seen in this link: http://stackoverflow.com/questions/9917242/create-a-matrix-from-a-function-and-two-numeric-data-frames – Tyler Rinker Oct 28 '12 at 19:34

5 Answers5

49

corr.test in the psych package is designed to do this:

library("psych")
data(sat.act)
corr.test(sat.act)

As noted in the comments, to replicate the p-values from the base cor.test() function over the entire matrix, then you need to turn off adjustment of the p-values for multiple comparisons (the default is to use Holm's method of adjustment):

 corr.test(sat.act, adjust = "none")

[But be careful when interpreting those results!]

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131
  • 2
    beautiful, why reinvent the wheel. +1g – Tyler Rinker Oct 28 '12 at 23:40
  • 7
    Just a note if you want the results to match the stats `cor.test` use `corr.test(mtcars, adjust="none")` – Tyler Rinker Oct 28 '12 at 23:46
  • Tyler, I noticed that. Thanks! Both of you have been awesome and super helpful! – Atticus29 Oct 29 '12 at 02:51
  • 2
    If you have a big matrix, this will be very very slow! To speed it up, set the argument `ci=F` -- that takes about twice as long as cor() to run, whereas with `ci=T` (the default), it may take 100 times as long. – sssheridan Apr 29 '16 at 15:27
  • I got an error (Error in corr.test(x, y, adjust = "none", ci = F) : object 'sef' not found) when I tried to do "ci = F". I wrote an answer below that takes the important code from the function and just runs cor() and gives the pvalues. – Nick Clark Sep 14 '17 at 02:08
15

If you're strictly after the pvalues in a matrix format from cor.test here's a solution shamelessly stolen from Vincent (LINK):

cor.test.p <- function(x){
    FUN <- function(x, y) cor.test(x, y)[["p.value"]]
    z <- outer(
      colnames(x), 
      colnames(x), 
      Vectorize(function(i,j) FUN(x[,i], x[,j]))
    )
    dimnames(z) <- list(colnames(x), colnames(x))
    z
}

cor.test.p(mtcars)

Note: Tommy also provides a faster solution though less easy to impliment. Oh and no for loops :)

Edit I have a function v_outer in my qdapTools package that makes this task pretty easy:

library(qdapTools)
(out <- v_outer(mtcars, function(x, y) cor.test(x, y)[["p.value"]]))
print(out, digits=4)  # for more digits
Community
  • 1
  • 1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Edited and `[[3]]` indexes the list that `cor.test` outputs. The third element of this list is the p.value. – Tyler Rinker Oct 28 '12 at 20:08
  • 1
    @TylerRinker I find that it's more clear in code if one uses the named version of the list output. It's a little more clear if instead of `cor.test(x, y)[[3]]` you have `cor.test(x, y)[["p.value"]]` that you're extracting the p-value from the test. – Dason Oct 29 '12 at 03:19
  • @Dason I agree I just was lazy in that I guessed what the index was based on the output and was too lazy too use `str` or `names` on the out put from cor.test to find out. I blame the bots really for this. They've automated our lives to the point where we're just too lazy. – Tyler Rinker Oct 29 '12 at 03:42
  • Are you saying that your proposal can reach the same result as `p.mat.all <- psych:::cor.test(M.cor, alternative = "two.sided", method = c("pearson", "kendall", "spearman"), adjust = "none", ci = F)`? - - I think you just use Pearson cor here. – Léo Léopold Hertz 준영 Nov 10 '16 at 14:15
  • I love this method, so thank you! I needed to compute p-vals for multiple pairwise correlations, and rcorr was not running in my data because it was made of very large vectors. This did the trick! Thanks!! – Rodrigo Duarte Jan 10 '20 at 14:37
7

Probably the easiest way is to use the rcorr() from Hmisc. It will only take a matrix, so use rcorr(as.matrix(x)) if your data is in a data.frame. It will return you a list with: 1) matrix of r pairwise, 2) matrix of pairwise n, 3) matrix of p values for the r's. It automatically ignores missing data.

Ideally, a function of this kind should take data.frames too and also output confidence intervals in line with the 'New Statistics'.

CoderGuy123
  • 6,219
  • 5
  • 59
  • 89
  • This is ideal, but it's not running on my large dataset (50 variables (that I am assessing their similarity) x 46,000,000 observations). Gives a memory error. – Rodrigo Duarte Jan 10 '20 at 00:07
  • Try `wtd.cors()` from *weights* package. I think it uses some kind of approximation that's fast. If you need the p values etc., you can use `wtd.cor()` on each pairwise variable. If you still want more speed, you could look into doing one variable at a time and saving the z scores between computations, as this would save the operation of recalculating them a bunch of times. – CoderGuy123 Jan 10 '20 at 17:42
5

The accepted solution (corr.test function in the psych package) works, but is extremely slow for large matrices. I was working with a gene expression matrix (~20,000 by ~1,000) correlated to a drug sensitivity matrix (~1,000 by ~500) and I had to stop it because it was taking forever.

I took some code from the psych package and used the cor() function directly instead and got much better results:

# find (pairwise complete) correlation matrix between two matrices x and y
# compare to corr.test(x, y, adjust = "none")
n <- t(!is.na(x)) %*% (!is.na(y)) # same as count.pairwise(x,y) from psych package
r <- cor(x, y, use = "pairwise.complete.obs") # MUCH MUCH faster than corr.test()
cor2pvalue = function(r, n) {
  t <- (r*sqrt(n-2))/sqrt(1-r^2)
  p <- 2*(1 - pt(abs(t),(n-2)))
  se <- sqrt((1-r*r)/(n-2))
  out <- list(r, n, t, p, se)
  names(out) <- c("r", "n", "t", "p", "se")
  return(out)
}
# get a list with matrices of correlation, pvalues, standard error, etc.
result = cor2pvalue(r,n)

Even with two 100 x 200 matrices, the difference was staggering. A second or two versus 45 seconds.

> system.time(test_func(x,y))
   user  system elapsed 
  0.308   2.452   0.130 
> system.time(corr.test(x,y, adjust = "none"))
   user  system elapsed 
 45.004   3.276  45.814 
Nick Clark
  • 131
  • 1
  • 2
  • Note: I just saw above that you can use corr.test() with "ci = F" option to make it faster. However, it gave me an error when I tried it. – Nick Clark Sep 14 '17 at 02:09
  • Looks like there is a small bug in the code. See my fix here (I know it's read-only): https://github.com/cran/psych/pull/2/commits/903484b426699f291ba158754188582b929572de I emailed the package maintainer about it. – Nick Clark Sep 14 '17 at 02:40
-1

"The accepted solution (corr.test function in the psych package) works, but is extremely slow for large matrices."

If you use ci=FALSE, then the speed is much faster. By default, confidence intervals are found. However, this leads to a slight slowdown of speed. So, for just the rs, ts and ps, set ci=FALSE.

kkica
  • 4,034
  • 1
  • 20
  • 40