0

I am trying making a item-based recommendation using cosine similarity with MapReduce.

Here's the input set.

itemIdx_1, userIdx_1
itemIdx_1, userIdx_2
itemIdx_2, userIdx_1
itemIdx_3, userIdx_3
... 

How do I design with this input data?

To use cosine similarity, I guess the input should be like below,

(no preferences, so data'd be 0 or 1 assumedly)
    itemIdx_1 , [userIdx_1:1, userIdx_2:1, userIdx_3:0]
    itemIdx_2 , [userIdx_1:0, userIdx_2:1, userIdx_3:0]
    ItemIdx_3 , [userIdx_1:0, userIdx_2:0, userIdx_3:1]
    ...

But How do I compare each row using MapReduce?

Please help. I've been sick of this for like a week..

Hoon
  • 1,571
  • 5
  • 15
  • 19

1 Answers1

0

item1 means a item and we can use vector [user1,user2,user3... userN] to identify it. And the cosine similarity of item1 and item2 is

sum(item1*item2)/ sqrt(sum(item1*item1))* sqrt(sum(item2*item2)) 

You can add this sqrt(sum(item1*item1)) to item1 row; I add it to item row as D.
NOTE: vector item is sparse.

And I use three steps to do this:

  1. Calculate the sqrt(sum(item*item)) for each item.
  2. Map by userid. In this way we get each user`s choice (and the same choice is in the same reduce function). E.g.

    userid    itemid     rating  (D) 
    223  2344 4
    223  2324 5
    223  3444 3
    

    Then combine every two items for each user. Then we get a list of a user`s every two items. E.g.

    itemid1 itemid2  rating1 rating2 (D1) (D2)  
    2344  2324 4 5  
    ........
    
  3. (very easy) Calculate itemid1 and itemid2's rating; use sum(item1*item2) and we have known sqrt(sum(item1*item1)) and sqrt(sum(item2*item2)), so we get the cosine similarity.

Nathan Tuggy
  • 2,237
  • 27
  • 30
  • 38
Edz
  • 1
  • 1