3

I want to perform a large matrix multiplication C = A * B.T and then filter C by applying a stringent threshold, collecting a list of the form (row index, column index, value).

A and B are sparse, with mostly zero entries. They are initially represented as sparse scipy csr matrices.

Sizes of the matrices (when they are in dense format):
A: 9G (900,000 x 1200)
B: 6.75G (700,000 x 1200)
C, before thresholding: 5000G
C, after thresholding: 0.5G

Using pyspark, what strategy would you expect to be most efficient here? Which abstraction should I use to parallelize A and B? What else should I be thinking about to optimize the partition sizes?


Should I stick with my scipy sparse matrices objects and simply parallelize them into RDDs (perhaps with some custom serialization)?

Should I store the non-zero entries of my A and B matrices using a DataFrame, then convert them to local pyspark matrix types when they are on the executors?

Should I use a DistributedMatrix abstraction from MLlib? For this strategy, I think I would first convert my scipy csr matrices to coo format, then create a pyspark CoordinateMatrix, then convert to either

  1. BlockMatrix? Dense representation but allows matrix multiplication w/ another distributed BlockMatrix.
  2. IndexedRowMatrix? Sparse representation but only allows matrix multiplication with a local matrix (e.g. a broadcast SparseMatrix ?)

*EDIT Going through the docs was also happy to discover the IndexedRowMatrix function columnSimilarities(), which may be a good option when the goal is computing cosine similarity.


I am looking for a local solution for now. I have two machines available for prototyping: either 16G RAM, 10 CPUs or 64G RAM, 28 CPUs. Planning to run this on a cluster once I have a good prototype.

brch
  • 407
  • 4
  • 7

0 Answers0