-1

I get spark dataframe like bellow, result is the id's vector:

+--------------------+--------------------+ 
| id                 |              result| 
+--------------------+--------------------+ 
|000ab862128e11eab...|[-0.46, 0.31, 0.2]  | 
|0026f306128e11eab...|[-0.46, 0.31, 0.2]  | 
|00313b10d11b11ea9...|[-0.25, 0.70, 0.36] | 
|00337629128e11eab...|[-0.46, 0.31, 0.51] | 
|005492e4128e11eab...|[0.55, 0.66, 0.85]  | 
+--------------------+--------------------+

How to get the top 5 most similar items efficiently? I have defined a cosineSimility function, which takes two vectors in "result" as parameters.

Matty
  • 11
  • 3

1 Answers1

0

You can use withColumn to call 'cosineSimility' function and store its result as new column. Then sort the dataframe based on this new column and take top(n) rows.

<dataframe>.
withColumn("rank", cosineSimility(col("result"))).
sort(col("rank").desc).
top(5)
Salim
  • 2,046
  • 12
  • 13
  • cosineSimility function takes two vectors in "result" column as parameters, the usage in your answer is not correct. – Matty Jan 27 '21 at 06:49
  • this is an example to call a function, sort. I thought that was your question. You need to pass all parameters needed. – Salim Jan 27 '21 at 18:36