0

Suppose I have a spark dataframe as follows. I have two vectors with 20M rows. One with column size of 15000 and the other with column size of 200.

>>> df.printSchema()
root
 |-- id: string (nullable = true)
 |-- vec1: vector (nullable = true)
 |-- vec2: vector (nullable = true)

>>> df.count()
20000000
>>> df.rdd.first()[1].size
15000
>>> df.rdd.first()[2].size
200

Representing two sets of vectors as matrices A and B, I want to calculate A'B, which would be a small 15000 X 200 matrix.

I tried to brute-force by

df.rdd \
.map(lambda row: np.outer(row["vec1"].toArray(), row["vec2"].toArray())) \
.reduce(lambda a,b: a+b)

This works with smaller column size vectors, but doesn't work with the dimensionality I'm dealing with.

I also tried to do mapPartitions approach, calculating the product within each partitions first, which doesn't seem to work well as well.

Is there a more efficient way to calculate A'B in this case?

Thank you!

Julius
  • 277
  • 2
  • 9
  • You could try using distributed matrices, see https://stackoverflow.com/questions/33558755/matrix-multiplication-in-apache-spark. – Shaido Jan 10 '19 at 03:23
  • The only type that allows multiplication is BlockMatrix and somehow it's really slow to create one from the dataframe.(I tried rdd zipWithIndex -> indexedRowMatrix -> BlockMatrix route). Any suggestion on this? – Julius Jan 10 '19 at 18:58

0 Answers0