1

I have the following Scala Spark DataFrame df of (String, Array[Double]): Note id is of type String (A base64 hash)

id, values
"a", [0.5, 0.6]
"b", [0.1, 0.2]
...

The dataset is quite large (45k) and I would like to perform a pairwise cosine similarity using org.apache.spark.mllib.linalg.distributed.RowMatrix for performance. This works, but I am not able to identify the pairwise similarities as the indexes have turned into integers (output columns i and j). How do I use IndexedRowMatrix to preserve the original indexes?

val rows = df.select("values")
            .rdd
            .map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
            .map(org.apache.spark.mllib.linalg.Vectors.fromML)

val mat = new RowMatrix(rows)

val simsEstimate = mat.columnSimilarities()

Ideally, the end result should look something like this:

id_x, id_y, similarity
"a", "b", 0.9
"b", "c", 0.8
...
Ivan
  • 673
  • 2
  • 8
  • 20

1 Answers1

-1

columnSimilarities() compute similarities between columns of the RowMatrix, not among rows, so "ids" you have are meaningless in this context and indices are indices in each feature vector.

Additionally these methods are designed for long, narrow and data, so an obvious approach - encode id with StringIndexer, create IndedxedRowMatrix, transpose, compute similarities, and go back (with IndexToString) simply won't do.

Your best bet here is to take crossJoin

df.as("a").crossJoin(df.as("b")).where($"a.id" <= $"b.id").select(
  $"a.id" as "id_x", $"b.id" as "id_y", cosine_similarity($"a.values", $b.values")
)

where

val cosine_similarity = udf((xs: Array[Double], ys: Array[Double]) => ???)

is something you have to implement yourself.

Alternatively you can explode the data:

import org.apache.spark.sql.functions.posexplode

val long = ds.select($"id", posexplode($"values")).toDF("item", "feature", "value")

and then use method shown in Spark Scala - How to group dataframe rows and apply complex function to the groups? to compute similarity.