Transpose a RowMatrix in PySpark

Question

Hi I am wondering how to transpose a RowMatrix in PySpark.

data = [(MLLibVectors.dense([1.0, 2.0]), ), (MLLibVectors.dense([3.0, 4.0]), )]

df=sqlContext.createDataFrame(data, ["features"])
features=df.select("features").rdd.map(lambda row: row[0])

mat=RowMatrix(features)
print mat.rows.first()
#[1.0,2.0]

mat=mat.Transpose()

print mat.rows.first()
#[1.0,3.0]

Anyone implement this in Python? I've seen similar posts but everything is in Scala. Thanks.

Psidom · Accepted Answer · 2017-11-03T19:14:26.303

5

RowMatrix doesn't have a transpose method. You might need a BlockMatrix or a CoordinateMatrix.

from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry

cm = CoordinateMatrix(
    mat.rows.zipWithIndex().flatMap(
        lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
    )
)

cm.toRowMatrix().rows.first().toArray()
# array([ 1.,  2.])

cm.transpose().toRowMatrix().rows.first().toArray()
# array([ 1.,  3.])

edited Nov 03 '17 at 19:14

answered Nov 03 '17 at 18:50

Psidom

209,562
33
339
356

Interesting, thanks for the help. I will go with this and if I find a different way (I've tried to convert from the Scala code here https://stackoverflow.com/questions/30556478/matrix-transpose-on-rowmatrix-in-spark to Python but no good luck, will post if I see something otherwise). – Patrick Ruff Nov 06 '17 at 14:03
1

The code works perfectly fine but I found that this operation takes place only in a single core. Is there a way by which this can be parallelized to run on all the nodes in the cluster? – Nikhil Baby Dec 22 '17 at 04:18

Transpose a RowMatrix in PySpark

1 Answers1