I have a matrix in a CoordinateMatrix format in Scala. The Matrix is sparse and the entires look like (upon coo_matrix.entries.collect),
Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(
MatrixEntry(0,0,-1.0), MatrixEntry(0,1,-1.0), MatrixEntry(1,0,-1.0),
MatrixEntry(1,1,-1.0), MatrixEntry(1,2,-1.0), MatrixEntry(2,1,-1.0),
MatrixEntry(2,2,-1.0), MatrixEntry(0,3,-1.0), MatrixEntry(0,4,-1.0),
MatrixEntry(0,5,-1.0), MatrixEntry(3,0,-1.0), MatrixEntry(4,0,-1.0),
MatrixEntry(3,3,-1.0), MatrixEntry(3,4,-1.0), MatrixEntry(4,3,-1.0),
MatrixEntry(4,4,-1.0))
This is only a small sample size. The Matrix is of size a N x N (where N = 1 million) though a majority of it is sparse. What is one of the efficient way of getting row sums of this matrix in Spark Scala? The goal is to create a new RDD composed of row sums i.e. of size N where 1st element is row sum of row1 and so on ..
I can always convert this coordinateMatrix to IndexedRowMatrix and run a for loop to compute rowsums one iteration at a time, but it is not the most efficient approach.
any idea is greatly appreciated.