3

Question is in the title.

I'm working with dense matrices on the order of 10K-100K x 1K-10K, and need PCA to do dimensionality reduction. I'll typically want to capture ~95% of the variance. In my data I typically get this with about 1/3rd of the components -- the data is not just dense, the level of redundancy is only moderate.

PCA with prcomp is painfully slow in higher dimensions and/or with larger N. I don't see an obvious way forward however, as the source code for prcomp is really simple -- it basically just does housekeeping around a call to svd. And presumably svd has been heavily optimized.

I've tried the package irlba, but it's slower than prcomp for some reason -- perhaps having something to do with my dataset.

Just for fun, here's a simple example:

library(MASS)    
N <- 10
p <- 2
out <- c()
haventHitControlCYet <- TRUE
i <- 0
while(haventHitControlCYet == TRUE) {
    i <- i+1
    N <- N*2
    p <- p*2
    C <- matrix(rnorm(p^2), p)
    covmat <- crossprod(C)
    X <- mvrnorm(N, rep(0, p), covmat)
    PT <- proc.time()
    pca <- prcomp(X, center = T, scale. = T)
    out[i] <- (proc.time() - PT)[3]
    plot(out, type = "b")
}

Compute time way-more-than doubles when the size of the data doubles.

enter image description here

Asking this question because googling turns up packages for specific sorts of data (genomics, etc.), and I'm looking for a general way to speed up computation of PCA for dense matrices where N>p.

Surprised that there isn't already a thread on this, as it seems like it would be a common question. Unless, of course there is a hard theoretical limit on speed, which has been reached already.(?)

generic_user
  • 3,430
  • 3
  • 32
  • 56

0 Answers0