Can't get mahout itemsimilarity result with preferences (booleanValue=false)

Question

I'm trying to create get itemismilarity using mahout. The problem is that I do get few similarities in output.

Here are my input data characteristics:

15.910.847 total count of preferences
4.047.745 distinct users
773.015 distinct items I've built the distribution of users and prefereces

The first column is count of distinct users

The second column is count of preferences per users. I do have 2.221.760 uses which have only one preference, for example.

2221760   1
688258    2
322497    3
192003    4
122446    5
87033 6
63733 7
49556 8
39090 9
31637 10
25634 11

Here are my input settings:

similarityClassname=SIMILARITY_PEARSON_CORRELATION
maxSimilaritiesPerItem=100000
minPrefsPerUser=0

booleanData=false
threshold=0.75

score 0 · Answer 1 · answered Jul 26 '14 at 01:13

0

Column 1 should be Mahout user ID, from 0 thru number of users - 1 Column 2 should be Mahout item ID, from 0 to number of items - 1 You can't just count preferences you want to record each item that the user showed some preference for Column 3 is the strength of preference, like a rating

IDs are like row and column numbers in a matrix or table 0,0 is user 0 item 0, the value is the rating.

You must translate your IDs into Mahout IDs then back again into your ids when reading the results of itemsimilarity.

answered Jul 26 '14 at 01:13

pferrel

5,673
5
30
41

Hi :) I got your response in mahout user group. I'm using org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob and I can't find requirement for mapping numerical user and item ids to range [0..qtty_of_users], [0..qtty_of_items] before feeding preference data to mahout.... The same for org.apache.mahout.cf.taste.hadoop.item.RecommenderJob where I can get itemSimilarityMatrix – Capacytron Jul 26 '14 at 10:17
And yet it is true. I'll write up a wiki page describing this. We just rewrote the Wiki for 0.9 release and is was missed. – pferrel Jul 27 '14 at 01:07
Great, going to try and report. – Capacytron Jul 27 '14 at 11:30
didn't help. I've used apache pig Rank function to give 1..N id for each distinct user_id and 1..M for each distinct item_id. I did feed ItemSimilarityJob with dataset, where user_id in [1..N], item_id [1..M], preference: 1.0 or 2.0, output result is pretty the same 16*10^6 preferences from 4*10^6 users for 7*10^5 items give only 10^3 similarities for 10 items... What do I do wrong? Does input dataset should be ordered by item/users id? – Capacytron Jul 27 '14 at 20:46

score 0 · Answer 2 · answered Aug 06 '14 at 23:34

didn't help. I've used apache pig Rank function to give 1..N id for each distinct user_id and 1..M for each distinct item_id. I did feed ItemSimilarityJob with dataset, where user_id in [1..N], item_id [1..M], preference: 1.0 or 2.0, output result is pretty the same 16*10^6 preferences from 4*10^6 users for 7*10^5 items give only 10^3 similarities for 10 items... What do I do wrong? Does input dataset should be ordered by item/users id? – Sergey Jul 27 at 20:46

It is because Mahout taste implementations accept ints as input for user_ids. If you supply anything that overflows MAX.int value, it will rollover to a minimum, which means it wont get added as a unique user.

You could perhaps Hash your user_id if it exceeds MAX.int value before you feed it to Mahout. Or you could have a alpha-numeric id and then user the ID-Migrator class for user_id input.

score 0 · Answer 3 · edited Oct 15 '15 at 07:53

0

I should use COSINE_SIMILARTY instead of SIMILARITY_PEARSON_CORRELATION because I use discrete preference
- Don't use threshold, It works not like with param booleanData=true
- I'm not sure that I have to "remap" my natural user, item ids to to surrogate [0...N].

Problem looks like solved. Thank you guys!

edited Oct 15 '15 at 07:53

lrkwz

6,105
3
36
59

answered Aug 07 '14 at 08:14

Capacytron

3,425
6
47
80

Can't get mahout itemsimilarity result with preferences (booleanValue=false)

3 Answers3