This isn't a complete/working solution, but will give you an idea of some of the issues.
Your correlation matrix will contain n*(n-1)/2
= 1535771331 unique elements. If each correlation takes one millisecond to compute, computing the correlation matrix will take (n^2-n)/2/(1e6*3600)
= 0.42 hours and require (n^2-n)/2*8/(2^30)
= 11.4 Gb of storage. These requirements are not impossible if you have a lot of RAM and time ...
In fact it's a little bit worse than this, since rcorr
returns its results as a symmetric matrix (i.e., not taking advantage of the symmetry), and returns the n
and P
matrices as well, so the storage requirement will be approximately 5 times as great as stated above (double for the full matrix, x 2.5 because we have two double-precision and one integer matrix).
Getting to your specific question, the section on long vectors in the R internals manual discusses the maximum sizes of objects in R. The 'standard' limitation is that the total number of elements of the matrix should be less than 2^31 ((n^2-n)/2/(2^31-1)
= 0.72), but the redundancy in the matrix gets you in trouble (as would the storage of the correlation, p-values, and the sample sizes).
If you still want to go ahead, here is an implementation by A.N. Spiess, copied from here, that breaks the problem into blocks and stores the results in a disk-backed array (i.e., not in RAM). This won't get you the p-values (and it's still not clear what you're going to do with all those values ...), but it works at least up to 40,000 columns (takes about a minute).
However, it seems to crap out on your actual problem size (888 x 55242) with a too-large length. I'd have to look more closely and see if there is a limitation here we can get around ... It seems that we are actually still limited by the matrix dimensions ... (maximum matrix dimension is sqrt(2^31-1)
approx. 46341 ... With more work, we could still do the block-diagonal thing and split this into several components ...
set.seed(101)
nc <- 55422
nr <- 888
d <- matrix(rnorm(nr*nc), ncol = nc)
t1 <- system.time(b1 <- bigcor(d))
bigcor <- function(
x,
y = NULL,
fun = c("cor", "cov"),
size = 2000,
verbose = TRUE,
...)
{
if (!require("ff")) stop("please install the ff package")
fun <- match.arg(fun)
if (fun == "cor") FUN <- cor else FUN <- cov
if (fun == "cor") STR <- "Correlation" else STR <- "Covariance"
if (!is.null(y) & NROW(x) != NROW(y)) stop("'x' and 'y' must have compatible dimensions!")
NCOL <- ncol(x)
if (!is.null(y)) YCOL <- NCOL(y)
## calculate remainder, largest 'size'-divisible integer and block size
REST <- NCOL %% size
LARGE <- NCOL - REST
NBLOCKS <- NCOL %/% size
## preallocate square matrix of dimension
## ncol(x) in 'ff' single format
if (is.null(y)) resMAT <- ff(vmode = "double", dim = c(NCOL, NCOL))
else resMAT <- ff(vmode = "double", dim = c(NCOL, YCOL))
## split column numbers into 'nblocks' groups + remaining block
GROUP <- rep(1:NBLOCKS, each = size)
if (REST > 0) GROUP <- c(GROUP, rep(NBLOCKS + 1, REST))
SPLIT <- split(1:NCOL, GROUP)
## create all unique combinations of blocks
COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
if (!is.null(y)) COMBS <- cbind(1:length(SPLIT), rep(1, length(SPLIT)))
## initiate time counter
timeINIT <- proc.time()
## iterate through each block combination, calculate correlation matrix
## between blocks and store them in the preallocated matrix on both
## symmetric sides of the diagonal
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i, ]
G1 <- SPLIT[[COMB[1]]]
G2 <- SPLIT[[COMB[2]]]
## if y = NULL
if (is.null(y)) {
if (verbose) cat(sprintf("#%d: %s of Block %s and Block %s (%s x %s) ... ", i, STR, COMB[1],
COMB[2], length(G1), length(G2)))
RES <- FUN(x[, G1], x[, G2], ...)
resMAT[G1, G2] <- RES
resMAT[G2, G1] <- t(RES)
} else ## if y = smaller matrix or vector
{
if (verbose) cat(sprintf("#%d: %s of Block %s and 'y' (%s x %s) ... ", i, STR, COMB[1],
length(G1), YCOL))
RES <- FUN(x[, G1], y, ...)
resMAT[G1, ] <- RES
}
if (verbose) {
timeNOW <- proc.time() - timeINIT
cat(timeNOW[3], "s\n")
}
gc()
}
return(resMAT)
}