1

Background:
I'm a software engineering student and I was checking out several algorithms for recommendation systems. One of these algorithms, a collaborative filtering has a lot of loops int it, it has to go through all of the users and for each user all of the ratings he has made on movies, or other rateable items. I was thinking of implementing it on ruby for a rails app.

The point is there is a lot of data to be processed so:

  1. Should this be done in the database? using regular queries? using PL/SQL or something similar (Testing dbs is extremely time consuming and hard, specially for these kind of algorithms )

  2. Should I do a background job that caches the results of the algorithm? (If so the data is processed on memory and if there are millions of users, how well does this scale)

  3. Should I run the algorithm every time there is a request or every x requests? (Again, the data is processed in memory)

The Question:
I know there are things that do this like Apache Mahout but they rely on Hadoop for scaling. Is there another way out? is there a Mahout or Machine Learning equivalent for ruby and if so how where does the computation take place?

fernandohur
  • 7,014
  • 11
  • 48
  • 86
  • This doesn't directly answer your question, but might help: http://stackoverflow.com/questions/2084131/mahout-plugin-for-ruby-on-rails – Josh Jun 18 '13 at 07:10

1 Answers1

2

Here is my thoughts on each of the methods:

  1. No it should not. Some calculations would be much faster to run in your database and some would not. However it would be hard and time consuming to test exactly which calculations that should be runned in your db, and you would properly experience that some part of the algorithm is slow in postgreSQL or whatever you use. More importantly: this is not the right place to run logic, as you say yourself, it would be hard to test and it's overall a bad practice. It would also affect the performance of your requests overall each time the db have to calculate the algorithm. Also the db would still use a lot of memory processing this so that isn't a advantage.

  2. By far the best solution. See below for more explanation.

  3. This is a much better solution than number one. However this would mean that your apps performance would be very unstable. Some times all resources would be free for normal requests, and some times you would use all your resources on you calculations.

Option 2 is the best solution, as this doesn't interfere with the performance of the rest off your app and is much easier to scale as it works in isolation. If for example you experience that your worker can't keep up, you can just add some more running processes.

More importantly you would be able to run the background processes on a separate server and thereby easily monitor the memory and resource usage, and scale your server as necessary.

Even for real time updates a background job will be the best solution (if of course the calculation is not small enough to be done in the request). You could create a "high priority" queue that has enough resources to almost always be empty. If you need to show the result to the user with a reload, you would have to add some kind of push notification after a background job is complete. This notification could then trigger an update on the page through javascript (you can also check out the new live stream function of rails 4).

I would recommend something like Sidekiq with Redis. You could then cache the results in memcache or you could recalculate the result each time, that really depends on how often you would need to calculate this. With this solution, however, it would be much easier to setup a stable cache if you want it.

Where I work, we have an application that runs some heavy queries with a lot of calculations like this. Each night these jobs are queued and then run on a isolated server over the next few hours. This scales really well, and is also easy to monitor with new relic.

Hope this helps, and makes sense (I know my English isn't perfect), but please feel free to ask if I misunderstood something or you have more questions.

gniourf_gniourf
  • 44,650
  • 9
  • 93
  • 104
jokklan
  • 3,520
  • 17
  • 37
  • great answer, thanks. But what happens if the computations are expected to be "fast" i.e. a real time dashboard that has to operate on all the individual purchases customers are doing worldwide? data can't be cached or processed on a background thread, so how could it be done? – fernandohur Jun 19 '13 at 20:56
  • Do all these calculations need to be fast, or are some more important then others? Also is it a few large calculations or a lot of calculations all the time? – jokklan Jun 19 '13 at 21:18
  • I don't really have a specific problem in mind but basically if calculations aren't needed in real time you can do them in a background job and cache or persist the results. But when results are needed fast i.e. stock markets or related, what can you do? Maybe this is unrelated but StumbleUpon is (I think) a similar scenario: they compute your next stumble based on you and your friend's likes and recommend more precise content every stumble. Assuming that they indeed recalculate your next stumble every single time you 'like' something, would this still be done on background? – fernandohur Jun 20 '13 at 13:29
  • Short answer: yes! Long answer: a background job can be very fast if you implement it as such. Eg. on sidekiq github site does they warn you against background jobs being processed so fast that if triggered by a after_create the database query wouldn't even be done before the process, and that could result in errors. I will update my answer to explain this more. – jokklan Jun 20 '13 at 13:32
  • Thanks for the help @jokklan. I accepted your answer already but It would be great if you could improve the answer for SO – fernandohur Jun 20 '13 at 13:38
  • Thanks! Is this fine, or do you have some suggestions for improvement :)? – jokklan Jun 20 '13 at 13:46