Suppose I have a spark dataframe as follows. I have two vectors with 20M rows. One with column size of 15000 and the other with column size of 200.
>>> df.printSchema()
root
|-- id: string (nullable = true)
|-- vec1: vector (nullable = true)
|-- vec2: vector (nullable = true)
>>> df.count()
20000000
>>> df.rdd.first()[1].size
15000
>>> df.rdd.first()[2].size
200
Representing two sets of vectors as matrices A and B, I want to calculate A'B, which would be a small 15000 X 200 matrix.
I tried to brute-force by
df.rdd \
.map(lambda row: np.outer(row["vec1"].toArray(), row["vec2"].toArray())) \
.reduce(lambda a,b: a+b)
This works with smaller column size vectors, but doesn't work with the dimensionality I'm dealing with.
I also tried to do mapPartitions approach, calculating the product within each partitions first, which doesn't seem to work well as well.
Is there a more efficient way to calculate A'B in this case?
Thank you!