how to get the top k similar items given item vectors in spark dataframe?

Question

I get spark dataframe like bellow, result is the id's vector:

+--------------------+--------------------+ 
| id                 |              result| 
+--------------------+--------------------+ 
|000ab862128e11eab...|[-0.46, 0.31, 0.2]  | 
|0026f306128e11eab...|[-0.46, 0.31, 0.2]  | 
|00313b10d11b11ea9...|[-0.25, 0.70, 0.36] | 
|00337629128e11eab...|[-0.46, 0.31, 0.51] | 
|005492e4128e11eab...|[0.55, 0.66, 0.85]  | 
+--------------------+--------------------+

How to get the top 5 most similar items efficiently? I have defined a cosineSimility function, which takes two vectors in "result" as parameters.

word2vec similarity function just find similarity between words, this would not solve problems, because every "id" is constist of many words.I want to find similarity between "id"s. — Matty, Jan 27 '21 at 06:45

score 0 · Answer 1 · answered Jan 26 '21 at 15:30

0

You can use withColumn to call 'cosineSimility' function and store its result as new column. Then sort the dataframe based on this new column and take top(n) rows.

<dataframe>.
withColumn("rank", cosineSimility(col("result"))).
sort(col("rank").desc).
top(5)

answered Jan 26 '21 at 15:30

Salim

2,046
12
13

cosineSimility function takes two vectors in "result" column as parameters, the usage in your answer is not correct. – Matty Jan 27 '21 at 06:49
this is an example to call a function, sort. I thought that was your question. You need to pass all parameters needed. – Salim Jan 27 '21 at 18:36

how to get the top k similar items given item vectors in spark dataframe?

1 Answers1