Solving a large-scale linear system in Apache Spark

Question

I am currently looking to solve a large-scale linear system, Ax=b using Spark. I have done a lot of search in order to find a solution and this link has been the only solution I have found for calculating the pseudo-inverse of A in order to inverse and multiply it by b as the next step. For simplicity I will copy the solution here.

import org.apache.spark.mllib.linalg.{Vectors,Vector,Matrix,SingularValueDecomposition,DenseMatrix,DenseVector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix

def computeInverse(X: RowMatrix): DenseMatrix = {
  val nCoef = X.numCols.toInt
  val svd = X.computeSVD(nCoef, computeU = true)
  if (svd.s.size < nCoef) {
    sys.error(s"RowMatrix.computeInverse called on singular matrix.")
  }

  // Create the inv diagonal matrix from S 
  val invS = DenseMatrix.diag(new DenseVector(svd.s.toArray.map(x => math.pow(x,-1))))

  // U cannot be a RowMatrix
  val U = new DenseMatrix(svd.U.numRows().toInt,svd.U.numCols().toInt,svd.U.rows.collect.flatMap(x => x.toArray))

  // If you could make V distributed, then this may be better. However its alreadly local...so maybe this is fine.
  val V = svd.V
  // inv(X) = V*inv(S)*transpose(U)  --- the U is already transposed.
  (V.multiply(invS)).multiply(U)
  }

However the problem with this solution is that in the end, we will have to make U a local DenseMatrix and I think it will not be possible for large matrices. I would appreciate any help and thoughts in order to solve this problem.

Graham S · Answer 1 · 2016-09-13T12:56:23.937

You could try one of the iterative algorithms (e.g. PCG). Instead of solving Ax=b directly, you search for x that minimizes f(x)=0.5x^tAx -x^tb

With parallel PCG, the actual iteration is done serially; it's the simple multiplication and other operations that are shared among your workers. But this allows you to distribute your sparse matrix across your cluster.

Unfortunately Spark's linear algebra library is a work-in-progress and I don't have an example code to show you. There are probably better methods than PCG for your problem, we just need to implement them in Spark. Not sure what your background is but you could start by researching generally how systems of linear equations can be solved in parallel.

Edit: There's some more discussion here and here.

I found a python implementation of the LSQR algorithm [here](https://github.com/chocjy/randomized-LS-solvers/tree/master/src). — Graham S, Oct 11 '16 at 15:12

Solving a large-scale linear system in Apache Spark

1 Answers1