I have the following Scala Spark DataFrame df
of (String, Array[Double]
): Note id
is of type String (A base64 hash)
id, values
"a", [0.5, 0.6]
"b", [0.1, 0.2]
...
The dataset is quite large (45k) and I would like to perform a pairwise cosine similarity using org.apache.spark.mllib.linalg.distributed.RowMatrix
for performance. This works, but I am not able to identify the pairwise similarities as the indexes have turned into integers (output columns i and j). How do I use IndexedRowMatrix
to preserve the original indexes?
val rows = df.select("values")
.rdd
.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
.map(org.apache.spark.mllib.linalg.Vectors.fromML)
val mat = new RowMatrix(rows)
val simsEstimate = mat.columnSimilarities()
Ideally, the end result should look something like this:
id_x, id_y, similarity
"a", "b", 0.9
"b", "c", 0.8
...