How to multiply two distributed mllib matrices and get the result back in Scala in a standalone spark cluster with 9 worker machine and 1 driver machine? There are 27 workers i.e. 3 workers per worker machine with two cores each. The multiplication is to be done with the corresponding partitions i.e. 1st partition of Matrix A with 1st partition of Matrix B and so on. I am planning for 27 partitions.
The result of product of matrices should be received partition-wise. And also, how to maintain equal number of records in each partition? The Matrix A is smaller one but Matrix B is a larger one which can't fit into memory of a single machine. The target is to implement further transformations to partition-wise product of Mat A and Mat B.
Let me clear it with the following code. The following code creates two block matrices.
//creation of blocks as local matrices which are components of first block matrix
val eye1 = Matrices.dense(3, 2, Array(1, 2, 3, 4, 5, 6))
val eye2 = Matrices.dense(3, 2, Array(4, 5, 6, 7, 8, 9))
val eye3 = Matrices.dense(3, 2, Array(7, 8, 9, 1, 2, 3))
val eye4 = Matrices.dense(3, 2, Array(4, 5, 6, 1, 2, 3))
val blocks = sc.parallelize(Seq(
((0, 0), eye1), ((1, 1), eye2), ((2, 2), eye3), ((3, 3), eye4)),4)
//block matrix created with 3 rows per block and 2 columns per block
val blockMatrix = new BlockMatrix(blocks, 3, 2)
//creation of blocks as local matrices which are components of second block matrix
val eye5 = Matrices.dense(2, 4, Array(1, 2, 3, 4, 5, 6, 7, 8))
val eye6 = Matrices.dense(2, 4, Array(2, 4, 6, 8, 10, 12, 14, 16))
val eye7 = Matrices.dense(2, 4, Array(3, 6, 9, 12, 15, 18, 21, 24))
val eye8 = Matrices.dense(2, 4, Array(4, 8, 12, 16, 20, 24, 28, 32))
val blocks1 = sc.parallelize(Seq(
((0, 0), eye5), ((1, 1), eye6), ((2, 2), eye7), ((3,3), eye8)),4)
//block matrix created with 2 rows per block and 4 columns per block
val blockMatrix1 = new BlockMatrix(blocks1, 2, 4)
//The following line multiplies the block matrices
val blockProduct = blockMatrix.multiply(blockMatrix1)
//the indices of block matrix are converted to RDD
var blockMatrixIndex = blockProduct.blocks.map{
case((a,b),m) => (a,b)}
var (blockRowIndexMaxValue, blockColIndexMaxValue) = blockMatrixIndex.max()
//the data of block of blockmatrix is converted to RDD
var blockMatrixRDD = blockProduct.blocks.map{
case((a,b),m) => m}
//elements of block matrix are doubled
var blockMatrixRDDElementDoubled = blockMatrixRDD.map(x => x.toArray.map(y => 2*y))
//code for finding number of rows of individual block in the block matrix
var blockMatRowCount = blockMatrixRDD.map(x => x.numRows).first
//code for finding number of columns of individual block in the block matrix
var blockMatColCount = blockMatrixRDD.map(x => x.numCols).first
//data block of block matrix is recreated
var blockMatrixBlockRecreated = blockMatrixRDDElementDoubled.map(x => Matrices.dense(blockMatRowCount, blockMatColCount, x))
//code for generating index sequence for blocks of blockmatrix
val indexRange = List.range(0, blockRowIndexMaxValue + 1)
var indexSeq = indexRange zip indexRange
//partitioning index sequence into 4 partitions
var indexSeqRDD = sc.parallelize(indexSeq, blockRowIndexMaxValue + 1)
//code for regenerating block matrix in RDD form
var completeBlockMatrixRecreated = indexSeqRDD.zip(blockMatrixBlockRecreated)
The completeBlockMatrixRecreated is of type org.apache.spark.rdd.RDD[((Int, Int), org.apache.spark.mllib.linalg.Matrix)]. So it should contain 4 blocks.
If I am trying to execute
completeBlockMatrixRecreated.take(2)
It is showing the error "org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition"