What is the fastest way to do PCA on large dense matrices in R?

Question

Question is in the title.

I'm working with dense matrices on the order of 10K-100K x 1K-10K, and need PCA to do dimensionality reduction. I'll typically want to capture ~95% of the variance. In my data I typically get this with about 1/3rd of the components -- the data is not just dense, the level of redundancy is only moderate.

PCA with prcomp is painfully slow in higher dimensions and/or with larger N. I don't see an obvious way forward however, as the source code for prcomp is really simple -- it basically just does housekeeping around a call to svd. And presumably svd has been heavily optimized.

I've tried the package irlba, but it's slower than prcomp for some reason -- perhaps having something to do with my dataset.

Just for fun, here's a simple example:

library(MASS)    
N <- 10
p <- 2
out <- c()
haventHitControlCYet <- TRUE
i <- 0
while(haventHitControlCYet == TRUE) {
    i <- i+1
    N <- N*2
    p <- p*2
    C <- matrix(rnorm(p^2), p)
    covmat <- crossprod(C)
    X <- mvrnorm(N, rep(0, p), covmat)
    PT <- proc.time()
    pca <- prcomp(X, center = T, scale. = T)
    out[i] <- (proc.time() - PT)[3]
    plot(out, type = "b")
}

Compute time way-more-than doubles when the size of the data doubles.

Asking this question because googling turns up packages for specific sorts of data (genomics, etc.), and I'm looking for a general way to speed up computation of PCA for dense matrices where N>p.

Surprised that there isn't already a thread on this, as it seems like it would be a common question. Unless, of course there is a hard theoretical limit on speed, which has been reached already.(?)

The complexity of the `PCA` is O(p^2n+p^3). With your sizes of matrices it seams that the time complexity acts fair enough — storaged, Nov 14 '17 at 00:25
[This](https://stackoverflow.com/q/46253537/6103040) might help you. — F. Privé, Nov 14 '17 at 06:48

What is the fastest way to do PCA on large dense matrices in R?

0 Answers0