22

I am trying to perform matrix multiplication using Apache Spark and Java.

I have 2 main questions:

  1. How to create RDD that can represent matrix in Apache Spark?
  2. How to multiply two such RDDs?
Community
  • 1
  • 1
Jigar
  • 468
  • 1
  • 5
  • 15

1 Answers1

55

All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed. At this moment it provides four different implementations of the DistributedMatrix

  • IndexedRowMatrix - can be created directly from a RDD[IndexedRow] where IndexedRow consist of row index and org.apache.spark.mllib.linalg.Vector

    import org.apache.spark.mllib.linalg.{Vectors, Matrices}
    import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix,
      IndexedRow}
    
    val rows =  sc.parallelize(Seq(
      (0L, Array(1.0, 0.0, 0.0)),
      (0L, Array(0.0, 1.0, 0.0)),
      (0L, Array(0.0, 0.0, 1.0)))
    ).map{case (i, xs) => IndexedRow(i, Vectors.dense(xs))}
    
    val indexedRowMatrix = new IndexedRowMatrix(rows)
    
  • RowMatrix - similar to IndexedRowMatrix but without meaningful row indices. Can be created directly from RDD[org.apache.spark.mllib.linalg.Vector]

    import org.apache.spark.mllib.linalg.distributed.RowMatrix
    
    val rowMatrix = new RowMatrix(rows.map(_.vector))      
    
  • BlockMatrix - can be created from RDD[((Int, Int), Matrix)] where first element of the tuple contains coordinates of the block and the second one is a local org.apache.spark.mllib.linalg.Matrix

    val eye = Matrices.sparse(
      3, 3, Array(0, 1, 2, 3), Array(0, 1, 2), Array(1, 1, 1))
    
    val blocks = sc.parallelize(Seq(
       ((0, 0), eye), ((1, 1), eye), ((2, 2), eye)))
    
    val blockMatrix = new BlockMatrix(blocks, 3, 3, 9, 9)
    
  • CoordinateMatrix - can be created from RDD[MatrixEntry] where MatrixEntry consist of row, column and value.

    import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix,
      MatrixEntry}
    
    val entries = sc.parallelize(Seq(
       (0, 0, 3.0), (2, 0, -5.0), (3, 2, 1.0),
       (4, 1, 6.0), (6, 2, 2.0), (8, 1, 4.0))
    ).map{case (i, j, v) => MatrixEntry(i, j, v)}
    
    val coordinateMatrix = new CoordinateMatrix(entries, 9, 3)
    

First two implementations support multiplication by a local Matrix:

val localMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))

indexedRowMatrix.multiply(localMatrix).rows.collect
// Array(IndexedRow(0,[1.0,4.0]), IndexedRow(0,[2.0,5.0]),
//   IndexedRow(0,[3.0,6.0]))

and the third one can be multiplied by an another BlockMatrix as long as number of columns per block in this matrix matches number of rows per block of the other matrix. CoordinateMatrix doesn't support multiplications but is pretty easy to create and transform to other types of distributed matrices:

blockMatrix.multiply(coordinateMatrix.toBlockMatrix(3, 3))

Each type has its own strong and weak sides and there are some additional factors to consider when you use sparse or dense elements (Vectors or block Matrices). Multiplying by a local matrix is usually preferable since it doesn't require expensive shuffling.

You can find more details about each type in the MLlib Data Types guide.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • 2
    This is a great summary for how to create the different matrix types -- I think I'm going to bookmark this! – Brian Risk Jul 08 '16 at 12:47
  • What are the `BlockMatrix` parameters used for? Also I see multiply can take an int param? I am having trouble trying to multiply 250k x 30k X 30k x 30k – Dan Ciborowski - MSFT Mar 27 '18 at 00:43
  • 1
    Here is the question though, even though you can do matrix multiplication in spark, should you? It seems like so many other libraries in the C world do matrix multiplication much better, like in deep learning libraries and such. They even leverage distributed GPU processing in some cases. How much less efficient is matrix operations in spark? – oneirois Sep 01 '18 at 12:44
  • 1
    I found this very useful. The users who closed this question are doing a disservice. – Dean Schulze Feb 06 '19 at 01:41
  • 1
    I would like to know which one scales better if, for instance, you would use it for `columnSimilarities`? – Maziyar Dec 02 '19 at 20:00