Spark Matrix multiplication with python

Question

I am trying to do matrix multiplication using Apache Spark and Python.

Here is my data

from pyspark.mllib.linalg.distributed import RowMatrix

My RDD of vectors

rows_1 = sc.parallelize([[1, 2], [4, 5], [7, 8]])
rows_2 = sc.parallelize([[1, 2], [4, 5]])

My maxtrix

mat1 = RowMatrix(rows_1)
mat2 = RowMatrix(rows_2)

I would like to do something like this:

mat = mat1 * mat2

I wrote a function to process the matrix multiplication but I'm afraid to have a long processing time. Here is my function:

def matrix_multiply(df1, df2):
    nb_row = df1.count()    
    mat=[]
    for i in range(0, nb_row):
        row=list(df1.filter(df1['index']==i).take(1)[0])
        row_out = []
        for r in range(0, len(row)):
            r_value = 0
            col = df2.select(df2[list_col[r]]).collect()
            col = [list(c)[0] for c in col]
            for c in range(0, len(col)): 
                r_value += row[c] * col[c]
            row_out.append(r_value)            
        mat.append(row_out)
    return mat

My function make a lot of spark actions (take, collect, etc.). Does the function will take a lot of processing time? If someone have another idea it will be helpful for me.

zero323 · Answer 1 · 2016-06-13T19:16:24.857

9

You cannot. Since RowMatrix has no meaningful row indices it cannot be used for multiplications. Even ignoring that the only distributed matrix which supports multiplication with another distributed structure is BlockMatrix.

from pyspark.mllib.linalg.distributed import *

def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
    return IndexedRowMatrix(
        rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
    ).toBlockMatrix(rowsPerBlock, colsPerBlock)

as_block_matrix(rows_1).multiply(as_block_matrix(rows_2))

edited Jun 13 '16 at 19:16

answered Jun 11 '16 at 17:51

zero323

322,348
103
959
935

1

Thank you for your answer. But it does not work for me. I'm using Spark 1.5.0. Here is the error message: **AttributeError: 'BlockMatrix' object has no attribute 'multiply'** – Raouf Jun 13 '16 at 11:52
1

It has been introduced in 1.6. – zero323 Jun 13 '16 at 14:08
1

Ok I see. I create a function to process it (see the post above). – Raouf Jun 14 '16 at 12:15
I'm getting a `SparkException: Job aborted due to stage failure` on the line `rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))` – Denis G. Labrecque Apr 19 '21 at 01:38

Spark Matrix multiplication with python

1 Answers1

Linked