0

I have a ruby daemon that selects 100 records from database and do a task with it.

To make it faster I usually create 3 instances of the same daemon. And each one selects diferents data by using mysql LIMIT and OFFSET.

The problem is that sometimes a task is performed 2 or 3 times with the same data record.

So I think that trusting only on database LIMIT and OFFSET is not enough ... since 2 or more daemons can actually collects the same data at the same time sometimes.

How can I do it safely? Avoiding 2 instances to select the same data

  • Daemon 1 => selects records from 1 to 100
  • Daemon 2 => selects records from 101 to 200
  • Daemon 3 => selects records from 201 to 300
Mitch Dempsey
  • 38,725
  • 6
  • 68
  • 74
newx
  • 595
  • 7
  • 16

1 Answers1

3

Rather than rolling your own solution, you might want to look at existing solutions for processing background jobs like Resque (a personal favorite). With Resque, you would queue a job for each of your rows using a trigger that makes sense in your application (it's hard to say without any context) for example a link on your website. At all times you would keep X number of workers running (three in your case) and Resque will do the queue management work for you. Resque uses Redis as a backend, so it supports atomic push/pop out of the gate (no more double-processing).

Resque also comes with a very intuitive and easy to use web interface for monitoring your jobs and workers.

bloudermilk
  • 17,820
  • 15
  • 68
  • 98
  • Currently I am using ruby's Daemons gem (github.com/mikehale/daemons). What I do is check how many instances are running and pass a PARAMS with LIMIT and OFFSET to each daemon instance. So ... if I have 1000 emails to be sent, each daemon instance selects diferents 100 database's rows. But sometimes I'm getting a row or email delivered 2 or 3 times. Meaning that 2 daemon's instances are selecting the same database rows. – newx Apr 27 '11 at 13:00
  • I understand what your problem is. What I'm trying to say is that you shouldn't be wasting your time dealing with this logic. People have already invested hundreds of hours of their own time to create libraries that work great for processing background jobs. You should be focusing on the jobs themselves. That being said, I'm sure you've just encountered a bug in your SQL. – bloudermilk Apr 27 '11 at 18:35
  • Thanks for your answer ... I'm reviewing Resque now to learn how can it improve my application... thanks a lot – newx May 05 '11 at 19:46
  • Bloudermilk, I don't think using Resque would be the best option for me since I need to control the flow of email delivery by selecting some rows but not others. If I use Resque I would not have a way to SELECT only the ROWS I want like I do on MYSQL. Right? – newx Nov 04 '11 at 16:51
  • @NewtonX it's impossible to say without knowing more about your application. Evaluate Resque, DelayedJob and other solutions to find out what is best for your application. Post new questions if necessary, no need to bump this thread. – bloudermilk Nov 05 '11 at 07:45
  • Thanks for your help ... after reading about Resque I learned about REDIS keystore database. I'm using it to distribute tasks between many servers since REDIS takes care of atomic updates. – newx May 29 '12 at 12:39