4

I'm running Sidekiq inside a Docker container in production and don't have access to the web UI. Sidekiq workers appear to have failed and I need to check whether they have indeed failed and delete or retry them.

Not a hundred percent what I'm seeing here but having collected the workers using workers = Sidekiq::Workers.new, I'm getting this result in the rails console which leads me to believe I have some dead jobs:

workers.each { |process_id, thread_id, work| puts "Worker #{work}\n\n" }

Worker {"queue"=>"default", "payload"=>{"retry"=>1, "queue"=>"default", "class"=>"PeopleWorker", "args"=>["<arg-1>", "55800c0161616600b5000000"], "jid"=>"08126d4162242a26825ce2d3", "enqueued_at"=>1436800316.1181111, "error_message"=>"Error 503: The query timed out", "failed_at"=>1436816149.1032495, "retry_count"=>0}, "run_at"=>1436870942}

Worker {"queue"=>"default", "payload"=>{"retry"=>1, "queue"=>"default", "class"=>"PeopleWorker", "args"=>["<arg-1>", "55800c0161616600b5000000"], "jid"=>"16a68d843116702daad847d6", "enqueued_at"=>1436800316.2001767, "error_message"=>"Error 503: The query timed out", "failed_at"=>1436816221.2766316, "retry_count"=>0}, "run_at"=>1436874457}

Worker {"queue"=>"default", "payload"=>{"retry"=>1, "queue"=>"default", "class"=>"PeopleWorker", "args"=>["<arg-1>", "55800c0161616600b5000000"], "jid"=>"999ed8c1bb43192fa9a5c8b1", "enqueued_at"=>1436800312.3595853, "error_message"=>"Error 503: The query timed out", "failed_at"=>1436816142.493408, "retry_count"=>0}, "run_at"=>1436868587}

Worker {"queue"=>"default", "payload"=>{"retry"=>1, "queue"=>"default", "class"=>"PeopleWorker", "args"=>["<arg-1>", "55800c0161616600b5000000"], "jid"=>"91d2ece3dd75dd8a4c95baed", "enqueued_at"=>1436800316.4514835, "error_message"=>"Error 503: The query timed out", "failed_at"=>1436817504.064808, "retry_count"=>0}, "run_at"=>1436875742}

Worker {"queue"=>"default", "payload"=>{"retry"=>1, "queue"=>"default", "class"=>"PeopleWorker", "args"=>["<arg-1>", "55800c0161616600b5000000"], "jid"=>"af620ff8406c126f8f2df89c", "enqueued_at"=>1436800315.562301, "error_message"=>"Error 503: The query timed out", "failed_at"=>1436816221.7349763, "retry_count"=>0}, "run_at"=>1436872039}

Worker {"queue"=>"default", "payload"=>{"retry"=>1, "queue"=>"default", "class"=>"PeopleWorker", "args"=>["<arg-1>", "55800c0161616600b5000000"], "jid"=>"79601ece1f09a7721881bb0b", "enqueued_at"=>1436800316.3225756, "error_message"=>"Error 500: GC overhead limit exceeded", "error_class"=>"Tripod::Errors::BadSparqlRequest", "failed_at"=>1436817517.111997, "retry_count"=>0}, "run_at"=>1436876319}

=> ["1cc9c3e7af3e:104", "1cc9c3e7af3e:117", "1cc9c3e7af3e:130", "1cc9c3e7af3e:150", "1cc9c3e7af3e:164", "1cc9c3e7af3e:191", "1cc9c3e7af3e:210", "1cc9c3e7af3e:224", "1cc9c3e7af3e:250", "1cc9c3e7af3e:263", "1cc9c3e7af3e:290", "1cc9c3e7af3e:311", "1cc9c3e7af3e:323", "1cc9c3e7af3e:350", "1cc9c3e7af3e:91"]

According to htop there are 15 Sidekiq processes currently running, so curious as to exactly what's happening here with these results.

  1. Am I correct in my understanding that, having hit an exception during execution, these jobs are in the dead queue?
  2. That being the case, should I force a retry of these jobs, or should they be deleted? I have no reason to think they will fail a second time.
Alex Lynham
  • 1,318
  • 2
  • 11
  • 29

2 Answers2

6

Please read through the Sidekiq API, include Sidekiq::RetrySet and Sidekiq::DeadSet.

https://github.com/mperham/sidekiq/wiki/API#retries

Jobs hitting an exception go into the RetrySet so they can be retried automatically.

Mike Perham
  • 21,300
  • 6
  • 59
  • 61
  • So to be clear - you're saying these are stuck because I haven't included `Sidekiq::RetrySet` in the worker, so the jobs can't be retried? Or is it just that these jobs _have at some point_ hit an exception, and so that's stored even though they are now in the process of retrying? – Alex Lynham Jul 14 '15 at 14:06
  • The retry system is built into Sidekiq; you needn't do anything - it just works. https://github.com/mperham/sidekiq/wiki/Error-Handling – Mike Perham Jul 14 '15 at 15:56
  • To be clear, RetrySet and all of the other classes in `sidekiq/api` is what the Web UI uses to perform all its operations. If you can do it manually in the Web UI, you can do it programmatically with the API. – Mike Perham Jul 14 '15 at 16:17
  • Okay, thanks for that. I think my issue is that these were long-running jobs and they were falling off the radar after the thirty minute default. They were completing, they just weren't appearing in the queue or `sidekiq-status` methods. – Alex Lynham Jul 15 '15 at 08:48
5

Please use this command to clear "Dead" jobs statistics

Sidekiq::DeadSet.new.clear