0

I am trying to implement cosine similarity to calculate Item-Item Similairity using Input Dataset which looks like this -

UserID, ProductID, Transactions

where UserID, ProductID are Long values and Transaction is Integer.

I am following this example in Spark - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

In above example it expects a dense Vector as input , which gets converted to RowMatrix.

Could you please help me convert my input data set -

U1,P1,T1
U1,P3,T2
U2,P1,T4
U3,P1,T6
U3,P3,T7

to a Matrix of form -

   |P1|P2|P3
u1 |T1|  |T2
u2 |T4|T5|
u3 |T6|  |T7

I am aware of the way that I can create a CooridnateMatrix something like this -

val mat = new CoordinateMatrix(transactions.map( entry => MatrixEntry(entry.user,entry.product, entry.txns)))

But, this one uses actual user and product id values in place of indices of Matrix, and fails as soon as the values extend beyond Integer.

I need a way that i can convert my data to an indiced matrix.

saurzcode
  • 827
  • 12
  • 30
  • `MatrixEntry` takes `Long` not `Integers` so range shouldn't be an issue. – zero323 Apr 18 '17 at 15:07
  • Thanks for reply. Yes thats true, but when i try to convert that to RowMatrix - It fails here in toIndexMatrix method of CoordinateMatrix.scala `if (nl > Int.MaxValue) { sys.error(s"Cannot convert to a row-oriented format because the number of columns $nl is " + "too large.") }` – saurzcode Apr 18 '17 at 15:59

0 Answers0