1

I know techniques for dimensionality reduction like PCA or SVD.

I would like to know how these techniques are implemented in distributed Big Data platforms like Apache Spark.

Is there available a pseudocode or schema with the formulation? I would like to know what parts of the algorithm could cause a bottleneck due to communication issues.

Thank you very much in advance

Community
  • 1
  • 1
Rob
  • 1,080
  • 2
  • 10
  • 24
  • 1
    Check out http://stackoverflow.com/questions/40262539/pca-in-spark-mllib-and-spark-ml/40268082#40268082 – cangrejo Dec 06 '16 at 18:01
  • My question is partially solved in that question, they explain PCA. They parallelize A'A and then master node compute the eigenvalues with no parallelization. In SVD you decompose your matrix A into three submatrices A=USV'. I understand that the procedure to obtain S and V should be parallelized in the same way than PCA, but what about U? – Rob Dec 06 '16 at 21:33
  • 1
    If you need U you can obtain it by computing the product AVS^(-1). – cangrejo Dec 06 '16 at 21:48
  • Then I suppose that they broadcast V and S^(-1) and the product is performed in every partition of A. Thanks. – Rob Dec 06 '16 at 23:31
  • 1
    In mahout there is also dssvd - distributed stochastic singular value decomposition. It returns two U and V as DRMs (distributed row matrices, which in the spark case are wrappers around RDDs) [link](https://mahout.apache.org/users/algorithms/d-ssvd.html) – rawkintrevo Dec 19 '16 at 03:44

1 Answers1

1

Apache Mahout implements Distributed Stochastic Singular Value Decomposition which is directly based on Randomized methods for computing low-rank approximations of matrices by Nathan Halko

Note that dssvd is for Apache-Mahout Samsara which is a library that will run on top of Spark. So in essence this is a Spark based approach to svd which is in fact distributed.

With regard to a distributed PCA, Mahout also exposes distributed stochastic PCA- there has been some website shuffling recently, but the dspca (distributed stochastic Principal component analysis) is given as an example here which gives the algorithm and implementation.

Halko I believe (see reference above) also discusses distributed PCA. I can't tell you where the bottlenecks would be, but I hope this information gets you started in your research.

rawkintrevo
  • 659
  • 5
  • 16