0

I'm trying to create get itemismilarity using mahout. The problem is that I do get few similarities in output.

Here are my input data characteristics:

  • 15.910.847 total count of preferences
  • 4.047.745 distinct users
  • 773.015 distinct items I've built the distribution of users and prefereces

The first column is count of distinct users

The second column is count of preferences per users. I do have 2.221.760 uses which have only one preference, for example.

2221760   1
688258    2
322497    3
192003    4
122446    5
87033 6
63733 7
49556 8
39090 9
31637 10
25634 11

Here are my input settings:

similarityClassname=SIMILARITY_PEARSON_CORRELATION
maxSimilaritiesPerItem=100000
minPrefsPerUser=0

booleanData=false
threshold=0.75
Capacytron
  • 3,425
  • 6
  • 47
  • 80

3 Answers3

0

Column 1 should be Mahout user ID, from 0 thru number of users - 1 Column 2 should be Mahout item ID, from 0 to number of items - 1 You can't just count preferences you want to record each item that the user showed some preference for Column 3 is the strength of preference, like a rating

IDs are like row and column numbers in a matrix or table 0,0 is user 0 item 0, the value is the rating.

You must translate your IDs into Mahout IDs then back again into your ids when reading the results of itemsimilarity.

pferrel
  • 5,673
  • 5
  • 30
  • 41
  • Hi :) I got your response in mahout user group. I'm using org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob and I can't find requirement for mapping numerical user and item ids to range [0..qtty_of_users], [0..qtty_of_items] before feeding preference data to mahout.... The same for org.apache.mahout.cf.taste.hadoop.item.RecommenderJob where I can get itemSimilarityMatrix – Capacytron Jul 26 '14 at 10:17
  • And yet it is true. I'll write up a wiki page describing this. We just rewrote the Wiki for 0.9 release and is was missed. – pferrel Jul 27 '14 at 01:07
  • Great, going to try and report. – Capacytron Jul 27 '14 at 11:30
  • didn't help. I've used apache pig Rank function to give 1..N id for each distinct user_id and 1..M for each distinct item_id. I did feed ItemSimilarityJob with dataset, where user_id in [1..N], item_id [1..M], preference: 1.0 or 2.0, output result is pretty the same 16*10^6 preferences from 4*10^6 users for 7*10^5 items give only 10^3 similarities for 10 items... What do I do wrong? Does input dataset should be ordered by item/users id? – Capacytron Jul 27 '14 at 20:46
0

didn't help. I've used apache pig Rank function to give 1..N id for each distinct user_id and 1..M for each distinct item_id. I did feed ItemSimilarityJob with dataset, where user_id in [1..N], item_id [1..M], preference: 1.0 or 2.0, output result is pretty the same 16*10^6 preferences from 4*10^6 users for 7*10^5 items give only 10^3 similarities for 10 items... What do I do wrong? Does input dataset should be ordered by item/users id? – Sergey Jul 27 at 20:46

It is because Mahout taste implementations accept ints as input for user_ids. If you supply anything that overflows MAX.int value, it will rollover to a minimum, which means it wont get added as a unique user.

You could perhaps Hash your user_id if it exceeds MAX.int value before you feed it to Mahout. Or you could have a alpha-numeric id and then user the ID-Migrator class for user_id input.

exergy
  • 116
  • 5
0
  • I should use COSINE_SIMILARTY instead of SIMILARITY_PEARSON_CORRELATION because I use discrete preference
    • Don't use threshold, It works not like with param booleanData=true
    • I'm not sure that I have to "remap" my natural user, item ids to to surrogate [0...N].

Problem looks like solved. Thank you guys!

lrkwz
  • 6,105
  • 3
  • 36
  • 59
Capacytron
  • 3,425
  • 6
  • 47
  • 80