Too much data and not enough machines in our Hadoop cluster for Mahout Item-based CF

Question

At work I'm trying to build an Item-based recommendation system based on Mahout's Item-based CF package. Here's the problem that what we are dealing with:

Number of users: 6,000,000 Number of items: 200,000 Preferences: 10,000,000,000

If we have hundreds of machines in our Hadoop cluster, we might be able to finish the RecommenderJob within several hours. However, the problem is that because we are a small startup, our Hadoop cluster has only about 10 machines at this stage. Ideally, we would like to run the recommendation job once every couple of days.

In order to appreciate the scale of the problem, we have applied Mahout's Item-based CF on a small subset of the data:

Number of users: 100,000 Number of items: 80,000 Preferences: 3,000,000

Time taken for the RecommenderJob is about 10 minutes on our Hadoop cluster.

My question is, given our hardware limitation(unlikely to change in the short term), what can we do to speed things up with Mahout's Item-based CF?

score 0 · Accepted Answer · answered Jan 07 '14 at 17:26

You seem to have the standard scaling problem of recommendation systems. In your case you should split your analysis into multiple parts.

The item-item similarity calculation part.
The user-item recommendation part using the item-item similarity values.

The point is, that similarity between items having a lot of ratings doesn't change a lot. And exactly this is the costly part. This means you can calculate the similarity for them only once and do it again after a long time (weeks, months?). You can evaluate how much they change after a week, two weeks etc. Then you only need to calculate the item-item similarity for items with fewer ratings every day - if they have new ratings of course! Too few ratings are a problem for itself in the recommendation engine area. I won't go into this right now.

So, when you have your always up-to-date item-item-similarity list, you can do the user-item recommendation based on them. If the amount of your items doesn't change that much then this is a constant time operation. That could be done in real-time when the user access the app. So no need to calculate the recommendation for a user which never comes back. The predicted rating for a user-item is basically the sum of all items rated by that user weighted by the items similarity score. You need to check if mahout is providing

thanks so much for your answer. we have came up with a similar solution (pre-compute the similarity table first), additionally, we have also applied subsampling - we are now using about 5% of the total preferences to construct our item-similarity. — user2818034, Jan 13 '14 at 07:12
oh nice...i assume the subsampling haven't affected the overall quality that much? — fatih, Jan 13 '14 at 09:54

Too much data and not enough machines in our Hadoop cluster for Mahout Item-based CF

1 Answers1