At work I'm trying to build an Item-based recommendation system based on Mahout's Item-based CF package. Here's the problem that what we are dealing with:
Number of users: 6,000,000 Number of items: 200,000 Preferences: 10,000,000,000
If we have hundreds of machines in our Hadoop cluster, we might be able to finish the RecommenderJob within several hours. However, the problem is that because we are a small startup, our Hadoop cluster has only about 10 machines at this stage. Ideally, we would like to run the recommendation job once every couple of days.
In order to appreciate the scale of the problem, we have applied Mahout's Item-based CF on a small subset of the data:
Number of users: 100,000 Number of items: 80,000 Preferences: 3,000,000
Time taken for the RecommenderJob is about 10 minutes on our Hadoop cluster.
My question is, given our hardware limitation(unlikely to change in the short term), what can we do to speed things up with Mahout's Item-based CF?