Mahout Optimization : Multithreading TopItems.getTopUsers() and TopItems.getTopItems()

Question

We have the following system in place:
No of Users : ~500k
No of Items : ~100k

UserSimilarity userSimilarity = new TanimotoCoefficientSimilarity(dataModel);       
UserNeighborhood neighborhood = new NearestNUserNeighborhood(neighborHoodSize,userSimilarity, dataModel);
GenericBooleanPrefUserBasedRecommender recommender = new GenericBooleanPrefUserBasedRecommender(dataModel, neighborhood ,userSimilarity);

With the above recommender we were getting a response time with an average of 600ms for 400 neighbourhood size.

We tried making it to less than 100ms(online engine) and we did achieve this by using custom TopItems.getTopUsers() and TopItems.getTopItems() multithreaded(equal to no of cores) functions. Avg time taken for the functions
TopUsers(): ~ 30-40 ms
TopItems(): ~ 50-60 ms

However, when we tried to make many concurrent requests (even to order of 25), the response time shoots up to seconds.

We could afford to precompute something like the neighbourhood for each user but TopItems() still is a clear bottleneck for concurrent requests.

Would you suggest any way to improve response time for concurrent requests with multithreading?

The fallback option would be to store precomputed recommendations in some NoSql DB. This is going to be little expensive as we precompute on a regular basis even for not so active users. We could probably pick active users and precompute recommendations more often than that of not-so-active users.

Any thoughts?

score 1 · Answer 1 · answered Jul 11 '13 at 16:16

Yes, multi-threading does not increase the overall throughput of a system. It means you can answer one request faster by bringing to bear more threads. But when the number of concurrent requests equals your number of cores it's back to where you started, more or less; in fact the overhead of threading may make it slower.

Of course you can always try adding more machines and maintaining N instances of this service.

This is probably about as well as you're going to do on a neighborhood-based model. The item-neighborhood versions have some more levers to pull: you can control sampling of the number of items considered. This can help.

Beyond that you probably need to look at models built to scale better. I personally favor matrix factorization-based techniques as better in this way.

Thanks Sean for the matrix factorization mention. Looks like it is more powerful in terms of personalization. — user2572019, Jul 12 '13 at 10:57

Mahout Optimization : Multithreading TopItems.getTopUsers() and TopItems.getTopItems()

1 Answers1