Retrying failed jobs in RQ

Question

We are using RQ with our WSGI application. What we do is have several different processes in different back-end servers that run the tasks, connecting to (possibly) several different task servers. To better configure this setup, we are using a custom management layer in our system which takes care of running workers, setting up the task queues, etc.

When a job fails, we would like to implement a retry, which retries a job several times after an increasing delay, and eventually either complete it or have it fail and log an error entry in our logging system. However, I am not sure how this should be implemented. I have already created a custom worker script which allows us to log error to our database, and my first attempt at retry was something along the lines of this:

# This handler would ideally wait some time, then requeue the job.
def worker_retry_handler(job, exc_type, exc_value, tb):
    print 'Doing retry handler.'
    current_retry = job.meta[attr.retry] or 2

    if current_retry >= 129600:
        log_error_message('Job catastrophic failure.', ...)
    else:
        current_retry *= 2

        log_retry_notification(current_retry)
        job.meta[attr.retry] = current_retry
        job.save()
        time.sleep(current_retry)

        job.perform()

return False

As I mentioned, we also have a function in the worker file which correctly resolves the server to which it should connect, and can post jobs. The problem is not necessarily how to publish a job, but what to do with the job instance that you get in the exception handler.

Any help would be greatly appreciated. If there are suggestions or pointers on better ways to do this would also be great. Thanks!

score 1 · Accepted Answer · answered Jan 18 '13 at 16:11

I see two possible issues:

You should have a return value. False prevents the default exception handling from happening to the job (see the last section on this page: http://python-rq.org/docs/exceptions/)
I think by the time your handler gets called the job is no longer queued. I'm not 100% positive (especially given the docs that I pointed to above), but if it's on the failed queue, you can call requeue_job(job.id) to retry it. If it's not (which it sounds like it won't be), you could probably grab the proper queue and enqueue to it directly.

score 0 · Answer 2 · edited Mar 25 '22 at 10:51

I have a more pretty solution

from rq import Queue, Worker
from redis import Redis

redis_conn = Redis(host=REDIS_HOST, health_check_interval=30)
queues = [
    Queue(queue_name, connection=redis_conn, result_ttl=0) 
    for queue_name in ["Low", "Fast"]
]
worker = Worker(queues, connection=redis, exception_handlers=[retry_handler])


def retry_handler(job, exc_type, exception, traceback):
    if isinstance(exception, RetryException):
        sleep(RetryException.sleep_time)
        job.requeue()
        return False

The handler itself is responsible for deciding whether or not the exception handling is done, or should fall through to the next handler on the stack. The handler can indicate this by returning a boolean. False means stop processing exceptions, True means continue and fall through to the next exception handler on the stack.

It’s important to know for implementors that, by default, when the handler doesn’t have an explicit return value (thus None), this will be interpreted as True (i.e. continue with the next handler).

To prevent the next exception handler in the handler chain from executing, use a custom exception handler that doesn’t fall through, for example:

Retrying failed jobs in RQ

2 Answers2