I have a dataframe with two columns where each row has a Sparse Vector. I try to find a proper way to calculate the cosine similarity (or just the dot product) of the two vectors in each row.
However, I haven't been able to find any library or tutorial to do it for Sparse vectors.
The only way I found is the following:
Create a k X n matrix, where n items are described as k-dimensioned vectors. For representing each item as a k dimension vector, you can use ALS which represents each entity in a latent factor space. The dimension of this space (k) can be chosen by you. This k X n matrix can be represented as RDD[Vector].
Convert this k X n matrix to RowMatrix.
Use columnSimilarities() function to get a n X n matrix of similarities between n items.
I feel it is an overkill to calculate all the cosine similarities for each pair while I need it only for the specific pairs in my (quite big) dataframe.