0

Let's say I have these two Numpy arrays:

A = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)
B = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)

and I perform the following on them:

np.sum(np.dot(A, B))

Now, I'd like to be able to essentially perform the same calculation with the same matrices using PySpark in order to achieve a distributed calculation with my Spark cluster.

Does anyone know or have a sample that does something along these lines in PySpark?

Thank you very much for any help!

user2926603
  • 47
  • 1
  • 7
  • Seems relevant https://labs.yodas.com/large-scale-matrix-multiplication-with-pyspark-or-how-to-match-two-large-datasets-of-company-1be4b1b2871e#.u0khat9gy – kennytm Mar 19 '17 at 17:52
  • Perhaps, but I am unfortunately unable to apply that solution to my question. It seems to use different libraries and is a word/text based problem. – user2926603 Mar 19 '17 at 18:07
  • Well are your matrices dense or sparse? And are A and B really 1024×1024 or larger? – kennytm Mar 19 '17 at 18:13
  • Thanks for the replies, kennytm. A & B can be larger, but 1024x1024 should work for my testing. The size of the matrix really isn't my concern. Also, these are numpy arrays and I believe they can be easily converted into dense matrices, so am fine doing that, if it is needed. – user2926603 Mar 19 '17 at 18:30

1 Answers1

3

Using the as_block_matrix method from this post, you could do the following (but see the comment of @kennytm why this method can be slow for bigger matrices):

import numpy as np
from pyspark.mllib.linalg.distributed import RowMatrix
A = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)
B = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)

from pyspark.mllib.linalg.distributed import *

def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
    return IndexedRowMatrix(
        rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
    ).toBlockMatrix(rowsPerBlock, colsPerBlock)

matrixA = as_block_matrix(sc.parallelize(A))
matrixB = as_block_matrix(sc.parallelize(B))
product = matrixA.multiply(matrixB)
Community
  • 1
  • 1
Alex
  • 21,273
  • 10
  • 61
  • 73