8

I'm using delayed job in a setup where I run multiple workers. For the sake of my question, it doesn't really matter, but let's say I run 10 workers (doing that in development mode currently).

The problem I am having is that two different workers sometimes start working on the same job, calling the perform method on my job object.

To the best of my understanding Delayed Job is using pessimistic locking to prevent this from happening, but it seems it sometimes still have enough time to lock steal the job before the first worker has time to actually lock it.

I'm just asking to see if anyone else has experienced this problem, or if it is my setup that is misbehaving. I'm using Postrgres and this happens both in my dev machine and on Heroku where I host it.

I will try to work around it within my jobs, but it is still a bit problematic that this happens. Ideally it would never happen that delayed job works on the same job from two processes.

Thanks!

Kenny Lövrin
  • 781
  • 5
  • 15
  • I see something similar. Haven't been able to completely track it down, but seems that between checking for a lock and making the lock, multiple workers are grabbing and executing the job. – Rob Di Marco Mar 25 '13 at 18:23
  • 1
    I should say that I found setting ```Delayed::Worker.read_ahead=1``` in an initializer seemed to mitigate the problem. – Rob Di Marco Mar 25 '13 at 18:52
  • Had the same issue with Resque, didn't find a solution – Andrey Kryachkov Mar 25 '13 at 18:57
  • Thanks guys, it's really frustrating, even trying to lock my objects within the jobs sometimes doesn't work. I'll try to read ahead and see if it helps. – Kenny Lövrin Mar 26 '13 at 09:05
  • Setting the read ahead didn't help for me really, still the same problem. It's a shame, because it really means you cannot reliably have more than one worker on a queue, as I understand it. – Kenny Lövrin Mar 26 '13 at 09:23
  • Have you made any progress on this? I'm seeing similar things; in my case, it's causing email to be sent out 4 times instead of 1. Is there any chance you're falling afoul of the "jobs which take more than 4 hours can be picked up by another worker" rule? – Michael H. Jul 13 '13 at 02:59
  • 2
    No, we actually just gave up and swapped eveything for a Sidekiq solution instead. I must say while it sounds like more work setting up Redis and Sidekiq, it is very well worth it. We've had both much better performance and more or less "perfect" stability since we changed it. Our Sidekiq workers have processed almost 50 million jobs since we changed without any problems that we didn't cause ourselves. :) – Kenny Lövrin Jul 15 '13 at 08:11
  • @KennyLövrin for real eh? I'm maintaining a system that relies heavily on delayed job and it's as cloggy and slow as hell.. and we haven't even gone live yet! I'm gonna check out this sidekiq thingie – abbood Feb 21 '14 at 11:45

1 Answers1

1

We've run about 60 million jobs through delayed job with 12 workers and never had a report of this. Whats the SQL that your delayed job worker is running? Are you using a gem that is changing the locking behavior of postgres?

Here is what the DJ sql looks like for me:

UPDATE "delayed_jobs" SET locked_at = '2014-05-02 21:16:35.419748', locked_by =
'host:whatever.local pid:4729' WHERE id IN (SELECT id FROM "delayed_jobs" 
WHERE ((run_at <= '2014-05-02 21:16:35.415923' 
AND (locked_at IS NULL OR locked_at < '2014-05-02 17:16:35.415947') 
OR locked_by = 'host:whatever.local pid:4729') AND failed_at IS NULL) 
ORDER BY priority ASC, run_at ASC LIMIT 1 FOR UPDATE) RETURNING *

Do you have locking problems with any other code? Could you try running two rails console sessions and doing this:

Console Session 1:

User.find(1).with_lock do sleep(10); puts "worker 1 done" end

Console Session 2:

User.find(1).with_lock do sleep(1); puts "worker 2 done" end

Start both those at the same time and if 2 end before 1, you've got a locking problem more general that delayed job.

John Naegle
  • 8,077
  • 3
  • 38
  • 47