I read that I could use the columnSimilarities
method that comes with RowMatrix
to find the cosine similarity of various records (content-based). My data looks something like this:
genre,actor
horror,mohanlal shobhana pranav
comedy,mammooty suraj dulquer
romance,fahad dileep manju
comedy,prithviraj
Now,I have created a spark-ml pipeline to calculate the tf-idf of the above text features (genre, actor) and uses the VectorAssembler
in my pipeline to assemble both the features into a single column "features". After that, I convert my obtained DataFrame
using this :
val vectorRdd = finalDF.map(row => row.getAs[Vector]("features"))
to convert it into an RDD[Vector]
Then, I obtain my RowMatrix
by
val matrix = new RowMatrix(vectorRdd)
I am following this guide for a reference to cosine similarity and what I need is a method in spark-mllib to find the similarity between a particular record and all the others like this method in sklearn, as shown in the guide :
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
But, I am unable to find how to do this. I don't understand what matrix.columnSimilarities()
is comparing and returning. Can someone help me with what I am looking for?
Any help is appreciated! Thanks.