How are PCA and SVD distributed in libraries like MLlib or Mahout

Question

I know techniques for dimensionality reduction like PCA or SVD.

I would like to know how these techniques are implemented in distributed Big Data platforms like Apache Spark.

Is there available a pseudocode or schema with the formulation? I would like to know what parts of the algorithm could cause a bottleneck due to communication issues.

Thank you very much in advance

Check out http://stackoverflow.com/questions/40262539/pca-in-spark-mllib-and-spark-ml/40268082#40268082 — cangrejo, Dec 06 '16 at 18:01
My question is partially solved in that question, they explain PCA. They parallelize A'A and then master node compute the eigenvalues with no parallelization. In SVD you decompose your matrix A into three submatrices A=USV'. I understand that the procedure to obtain S and V should be parallelized in the same way than PCA, but what about U? — Rob, Dec 06 '16 at 21:33
If you need U you can obtain it by computing the product AVS^(-1). — cangrejo, Dec 06 '16 at 21:48
Then I suppose that they broadcast V and S^(-1) and the product is performed in every partition of A. Thanks. — Rob, Dec 06 '16 at 23:31
In mahout there is also dssvd - distributed stochastic singular value decomposition. It returns two U and V as DRMs (distributed row matrices, which in the spark case are wrappers around RDDs) [link](https://mahout.apache.org/users/algorithms/d-ssvd.html) — rawkintrevo, Dec 19 '16 at 03:44

rawkintrevo · Accepted Answer · 2016-12-20T20:13:49.163

Apache Mahout implements Distributed Stochastic Singular Value Decomposition which is directly based on Randomized methods for computing low-rank approximations of matrices by Nathan Halko

Note that dssvd is for Apache-Mahout Samsara which is a library that will run on top of Spark. So in essence this is a Spark based approach to svd which is in fact distributed.

With regard to a distributed PCA, Mahout also exposes distributed stochastic PCA- there has been some website shuffling recently, but the dspca (distributed stochastic Principal component analysis) is given as an example here which gives the algorithm and implementation.

Halko I believe (see reference above) also discusses distributed PCA. I can't tell you where the bottlenecks would be, but I hope this information gets you started in your research.

How are PCA and SVD distributed in libraries like MLlib or Mahout

1 Answers1