Multiply two numpy matrices in PySpark

Question

Let's say I have these two Numpy arrays:

A = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)
B = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)

and I perform the following on them:

np.sum(np.dot(A, B))

Now, I'd like to be able to essentially perform the same calculation with the same matrices using PySpark in order to achieve a distributed calculation with my Spark cluster.

Does anyone know or have a sample that does something along these lines in PySpark?

Thank you very much for any help!

Seems relevant https://labs.yodas.com/large-scale-matrix-multiplication-with-pyspark-or-how-to-match-two-large-datasets-of-company-1be4b1b2871e#.u0khat9gy — kennytm, Mar 19 '17 at 17:52
Perhaps, but I am unfortunately unable to apply that solution to my question. It seems to use different libraries and is a word/text based problem. — user2926603, Mar 19 '17 at 18:07
Well are your matrices dense or sparse? And are A and B really 1024×1024 or larger? — kennytm, Mar 19 '17 at 18:13
Thanks for the replies, kennytm. A & B can be larger, but 1024x1024 should work for my testing. The size of the matrix really isn't my concern. Also, these are numpy arrays and I believe they can be easily converted into dense matrices, so am fine doing that, if it is needed. — user2926603, Mar 19 '17 at 18:30

score 3 · Accepted Answer · edited May 23 '17 at 11:54

Using the as_block_matrix method from this post, you could do the following (but see the comment of @kennytm why this method can be slow for bigger matrices):

import numpy as np
from pyspark.mllib.linalg.distributed import RowMatrix
A = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)
B = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)

from pyspark.mllib.linalg.distributed import *

def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
    return IndexedRowMatrix(
        rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
    ).toBlockMatrix(rowsPerBlock, colsPerBlock)

matrixA = as_block_matrix(sc.parallelize(A))
matrixB = as_block_matrix(sc.parallelize(B))
product = matrixA.multiply(matrixB)

Multiply two numpy matrices in PySpark

1 Answers1

Linked