Efficient way of row/column sum of a IndexedRowmatrix in Apache Spark

Question

I have a matrix in a CoordinateMatrix format in Scala. The Matrix is sparse and the entires look like (upon coo_matrix.entries.collect),

Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(
  MatrixEntry(0,0,-1.0), MatrixEntry(0,1,-1.0), MatrixEntry(1,0,-1.0),
  MatrixEntry(1,1,-1.0), MatrixEntry(1,2,-1.0), MatrixEntry(2,1,-1.0), 
  MatrixEntry(2,2,-1.0), MatrixEntry(0,3,-1.0), MatrixEntry(0,4,-1.0), 
  MatrixEntry(0,5,-1.0), MatrixEntry(3,0,-1.0), MatrixEntry(4,0,-1.0), 
  MatrixEntry(3,3,-1.0), MatrixEntry(3,4,-1.0), MatrixEntry(4,3,-1.0),
  MatrixEntry(4,4,-1.0))

This is only a small sample size. The Matrix is of size a N x N (where N = 1 million) though a majority of it is sparse. What is one of the efficient way of getting row sums of this matrix in Spark Scala? The goal is to create a new RDD composed of row sums i.e. of size N where 1st element is row sum of row1 and so on ..

I can always convert this coordinateMatrix to IndexedRowMatrix and run a for loop to compute rowsums one iteration at a time, but it is not the most efficient approach.

any idea is greatly appreciated.

zero323 · Answer 1 · 2015-10-24T08:54:34.613

It will be quite expensive due to shuffling (this is the part you cannot really avoid here) but you can convert entries to PairRDD and reduce by key:

import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix}
import org.apache.spark.rdd.RDD

val mat: CoordinateMatrix = ???
val rowSums: RDD[Long, Double)] = mat.entries
  .map{case MatrixEntry(row, _, value) => (row, value)}
  .reduceByKey(_ + _)

Unlike solution based on indexedRowMatrix:

import org.apache.spark.mllib.linalg.distributed.IndexedRow

mat.toIndexedRowMatrix.rows.map{
  case IndexedRow(i, values) => (i, values.toArray.sum)
}

it requires no groupBy transformation or intermediate SparseVectors.

Efficient way of row/column sum of a IndexedRowmatrix in Apache Spark

1 Answers1