I am trying to implement cosine similarity to calculate Item-Item Similairity using Input Dataset which looks like this -
UserID, ProductID, Transactions
where UserID, ProductID are Long values and Transaction is Integer.
I am following this example in Spark - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
In above example it expects a dense Vector as input , which gets converted to RowMatrix.
Could you please help me convert my input data set -
U1,P1,T1
U1,P3,T2
U2,P1,T4
U3,P1,T6
U3,P3,T7
to a Matrix of form -
|P1|P2|P3
u1 |T1| |T2
u2 |T4|T5|
u3 |T6| |T7
I am aware of the way that I can create a CooridnateMatrix something like this -
val mat = new CoordinateMatrix(transactions.map( entry => MatrixEntry(entry.user,entry.product, entry.txns)))
But, this one uses actual user and product id values in place of indices of Matrix, and fails as soon as the values extend beyond Integer.
I need a way that i can convert my data to an indiced matrix.