0

I have a large matrix (1,000,000 rows by 1,140 columns) which I'm storing using the ff package.

Is there an efficient way to calculate a covariance matrix from this? Using the cov function gives the error:

Error in cov(X) : supply both 'x' and 'y' or a matrix-like 'x'

Which is not surprising given that cov doesn't understand ff objects. I'm currently using a simple nested for loop:

covarianceMatrix <- matrix(0,nrow=ncol(ffObject),ncol=ncol(ffObject))  
distinctValues <- sum(ncol(ffObject):1)
for(i in 1:ncol(ffObject))
{
  for(j in i:ncol(ffObject))
  {
    if(i==j)
    {
      covarianceMatrix[i,j] <- var(ffObject[,i])
    }
    else
    {
      covarianceMatrix[i,j] <- covarianceMatrix[j,i] <- cov(ffObject[,i],ffObject[,j])
    }
  }
}

which works but is very slow.

M. Berk
  • 189
  • 1
  • 6

1 Answers1

0

I found a solution based on the answer to the following question: https://scicomp.stackexchange.com/questions/5464/parallel-computation-of-big-covariance-matrices combined with some code from the bootSVD package available here: https://github.com/aaronjfisher/bootSVD/blob/master/R/bootstrap_functions.R. Specifically:

covarianceMatrix <- matrix(0,nrow=ncol(ffObject),ncol=ncol(ffObject))
ffapply({covarianceMatrix <- covarianceMatrix + crossprod(ffObject[i1:i2,]) },X=ffObject,MARGIN=1)
columnSums <- sapply(1:ncol(ffObject),function(i) sum(ffObject[,i]))

covarianceMatrix <- covarianceMatrix/nrow(ffObject) - (columnSums %*% t(columnSums))/nrow(ffObject)/nrow(ffObject)

This runs substantially faster than the code in the question, a matter of minutes rather than hours.

Community
  • 1
  • 1
M. Berk
  • 189
  • 1
  • 6