Question is in the title.
I'm working with dense matrices on the order of 10K-100K x 1K-10K, and need PCA to do dimensionality reduction. I'll typically want to capture ~95% of the variance. In my data I typically get this with about 1/3rd of the components -- the data is not just dense, the level of redundancy is only moderate.
PCA with prcomp
is painfully slow in higher dimensions and/or with larger N. I don't see an obvious way forward however, as the source code for prcomp
is really simple -- it basically just does housekeeping around a call to svd
. And presumably svd
has been heavily optimized.
I've tried the package irlba
, but it's slower than prcomp
for some reason -- perhaps having something to do with my dataset.
Just for fun, here's a simple example:
library(MASS)
N <- 10
p <- 2
out <- c()
haventHitControlCYet <- TRUE
i <- 0
while(haventHitControlCYet == TRUE) {
i <- i+1
N <- N*2
p <- p*2
C <- matrix(rnorm(p^2), p)
covmat <- crossprod(C)
X <- mvrnorm(N, rep(0, p), covmat)
PT <- proc.time()
pca <- prcomp(X, center = T, scale. = T)
out[i] <- (proc.time() - PT)[3]
plot(out, type = "b")
}
Compute time way-more-than doubles when the size of the data doubles.
Asking this question because googling turns up packages for specific sorts of data (genomics, etc.), and I'm looking for a general way to speed up computation of PCA for dense matrices where N>p.
Surprised that there isn't already a thread on this, as it seems like it would be a common question. Unless, of course there is a hard theoretical limit on speed, which has been reached already.(?)