15

When we restart or deploy we get a number of Resque jobs in the failed queue with either Resque::TermException (SIGTERM) or Resque::DirtyExit.

We're using the new TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 in our Procfile so our worker line looks like:

worker:  TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 bundle exec rake environment resque:work QUEUE=critical,high,low

We're also using resque-retry which I thought might auto-retry on these two exceptions? But it seems to not be.

So I guess two questions:

  1. We could manually rescue from Resque::TermException in each job, and use this to reschedule the job. But is there a clean way to do this for all jobs? Even a monkey patch.
  2. Shouldn't resque-retry auto retry these? Can you think of any reason why it wouldn't be?

Thanks!

Edit: Getting all jobs to complete in less than 10 seconds seems unreasonable at scale. It seems like there needs to be a way to automatically re-queue these jobs when the Resque::DirtyExit exception is run.

Brian Armstrong
  • 19,707
  • 17
  • 115
  • 144

5 Answers5

7

I ran into this issue as well. It turns out that Heroku sends the SIGTERM signal to not just the parent process but all forked processes. This is not the logic that Resque expects which causes the RESQUE_PRE_SHUTDOWN_TIMEOUT to be skipped, forcing jobs to executed without any time to attempt to finish a job.

Heroku gives workers 30s to gracefully shutdown after a SIGTERM is issued. In most cases, this is plenty of time to finish a job with some buffer time left over to requeue the job to Resque if the job couldn't finish. However, for all of this time to be used you need to set the RESQUE_PRE_SHUTDOWN_TIMEOUT and RESQUE_TERM_TIMEOUT env vars as well as patch Resque to correctly respond to SIGTERM being sent to forked processes.

Here's a gem which patches resque and explains this issue in more detail:

https://github.com/iloveitaly/resque-heroku-signals

iloveitaly
  • 2,053
  • 22
  • 21
  • This is the correct explanation. Thanks @iloveitaly – Yoni Jan 17 '18 at 15:37
  • I'm curious if anyone understands the current situation with resque 2.0.0. I don't trust that old patching gem with recent resque, but am also not sure if resque is behaving properly. – jrochkind Jun 07 '21 at 20:52
  • The resque-heroku-signals gem works with resque 2.0. I'm using it in production. – iloveitaly Jun 09 '21 at 00:52
3

This will be a two part answer, first addressing Resque::TermException and then Resque::DirtyExit.

TermException

It's worth noting that if you are using ActiveJob with Rails 7 or later the retry_on and discard_on methods can be used to handle Resque::TermException. You could write the following in your job class:

retry_on(::Resque::TermException, wait: 2.minutes, attempts: 4)

or

discard_on(::Resque::TermException)

A big caveat here is that if you are using a Rails version prior to 7 you'll need to add some custom code to get this to work.

The reason is that Resque::TermException does not inherit from StandardError (it inherits from SignalException, source: https://github.com/resque/resque/blob/master/lib/resque/errors.rb#L26) and prior to Rails 7 retry_on and discard_on only handle exceptions that inherit from StandardError.

Here's the Rails 7 commit that changes this to work with all exception subclasses: https://github.com/rails/rails/commit/142ae54e54ac81a0f62eaa43c3c280307cf2127a

So if you want to use retry_on to handle Resque::TermException on a Rails version earlier than 7 you have a few options:

  1. Monkey patch TermException so that it inherits from StandardError.
  2. Add a rescue statement to your perform method that explicitly looks for Resque::TermException or one of its ancestors (eg SignalException, Exception).
  3. Patch the implementation of perform_now with the Rails 7 version (this is what I did in my codebase).

Here's how you can retry on a TermException by adding a rescue to your job's perform method:

class MyJob < ActiveJob::Base
  prepend RetryOnTermination

  # ActiveJob's `retry_on` and `discard_on` methods don't handle 
  `TermException`
  # because it inherits from `SignalException` rather than `StandardError`.
  module RetryOnTermination
    def perform(*args, **kwargs)
     super
    rescue Resque::TermException
      Rails.logger.info("Retrying #{self.class.name} due to Resque::TermException")
      self.class.set(wait: 2.minutes).perform_later(*args, **kwargs)
    end
  end
end

Alternatively you can use the Rails 7 definition of perform_now by adding this to your job class:

  # FIXME: Here we override the Rails 6 implementation of this method with the
  # Rails 7 implementation in order to be able to retry/discard exceptions that
  # don't inherit from StandardError, such as `Resque::TermException`.
  #
  # When we upgrade to Rails 7 we should remove this.
  # Latest stable Rails (7 as of this writing) source: https://github.com/rails/rails/blob/main/activejob/lib/active_job/execution.rb
  # Rails 6.1 source: https://github.com/rails/rails/blob/6-1-stable/activejob/lib/active_job/execution.rb
  # Rails 6.0 source (same code as 6.1): https://github.com/rails/rails/blob/6-0-stable/activejob/lib/active_job/execution.rb
  #
  # NOTE: I've made a minor change to the Rails 7 implementation, I've removed
  # the line `ActiveSupport::ExecutionContext[:job] = self`, because `ExecutionContext`
  # isn't defined prior to Rails 7.
  def perform_now
    # Guard against jobs that were persisted before we started counting executions by zeroing out nil counters
    self.executions = (executions || 0) + 1

    deserialize_arguments_if_needed

    run_callbacks :perform do
      perform(*arguments)
    end
  rescue Exception => exception
    rescue_with_handler(exception) || raise
  end

DirtyExit

Resque::DirtyExit is raised in the parent process, rather than the forked child process that actually executes your job code. This means that any code you have in your job for rescuing or retrying those exceptions won't work. See these lines of code where that happens:

  1. https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L940
  2. https://github.com/resque/resque/blob/master/lib/resque/job.rb#L234
  3. https://github.com/resque/resque/blob/master/lib/resque/job.rb#L285

But fortunately, Resque provides a mechanism for dealing with this, job hooks, specifically the on_failure hook: https://github.com/resque/resque/blob/master/docs/HOOKS.md#job-hooks

A quote from those docs:

on_failure: Called with the exception and job args if any exception occurs while performing the job (or hooks), this includes Resque::DirtyExit.

And an example from those docs on how to use hooks to retry exceptions:

module RetriedJob
  def on_failure_retry(e, *args)
    Logger.info "Performing #{self} caused an exception (#{e}). Retrying..."
    Resque.enqueue self, *args
  end
end

class MyJob
  extend RetriedJob
end
1

Are your resque jobs taking longer than 10 seconds to complete? If the jobs complete within 10 seconds after the initial SIGTERM is sent you should be fine. Try to break up the jobs into smaller chunks that finish quicker.

Also, you can have your worker re-enqueue the job doing something like this: https://gist.github.com/mrrooijen/3719427

Michael van Rooijen
  • 6,683
  • 5
  • 37
  • 33
jfeust
  • 845
  • 1
  • 9
  • 19
  • Upvoted and accepted - I'm honestly not sure if we can get them all under 10 seconds though. We have some large exports etc which need to generate one file. Re-enqueueing seems like it solves this though? Can you share what the difference is between `Resque::TermException` and `Resque::DirtyExit`. I have a rescue in there for `Resque::DirtyExit` but it doesn't seem to always re-enqueue. Thanks! – Brian Armstrong May 07 '13 at 05:06
  • As an update, they strangely do not rescue those exceptions cleanly sometimes despite having `rescue Resque::DirtyExit` in the job. I haven't been able to figure out why. This is making our jobs unreliable as we still find them in the failed queue with Resque::DirtyExit exceptions. It's really becoming a problem – Brian Armstrong May 20 '13 at 22:29
  • 1
    Can someone recommend how the worker should handle the SIGTERM inside the worker so the worker can shut itself down cleanly? For example, should the (resque) worker also trap SIGTERM and set some variable that the looping code periodically inspects? Im assuming that the TermException or DirtyException is railed only after RESQUE_TERM_TIMEOUT secnds. – Mike P. Mar 16 '15 at 18:32
1
  1. We could manually rescue from Resque::TermException in each job, and use this to reschedule the job. But is there a clean way to do this for all jobs? Even a monkey patch.

The Resque::DirtyExit exception is raised when the job is killed with the SIGTERM signal. The job does not have the opportunity to catch the exception as you can read here.

  1. Shouldn't resque-retry auto retry these? Can you think of any reason why it wouldn't be?

Don't see why it shouldn't, is the scheduler running? If not rake resque:scheduler.

I wrote a detailed blog post around some of the problems I had recently with Resque::DirtyExit, maybe it is useful => Understanding the Resque internals – Resque::DirtyExit unveiled

mottalrd
  • 4,390
  • 5
  • 25
  • 31
  • You mentioned `SIGTERM` but linked to `SIGKILL`. The link specifically says that `SIGTERM` *can* be intercepted. – sixty4bit May 08 '18 at 00:06
1

I've also struggled with this for awhile without finding a reliable solution.

One of the few solutions I've found is running a rake task on a schedule (cron job every 1 minute) which looks for jobs failing with Resque::DirtyExit, retries these specific jobs and removes these jobs from the failure queue.

Here's a sample of the rake task https://gist.github.com/CharlesP/1818418754aec03403b3

This solution is clearly suboptimal but to date it's the best solution I've found to retry these jobs.

Charles
  • 337
  • 2
  • 7