We are using RQ with our WSGI application. What we do is have several different processes in different back-end servers that run the tasks, connecting to (possibly) several different task servers. To better configure this setup, we are using a custom management layer in our system which takes care of running workers, setting up the task queues, etc.
When a job fails, we would like to implement a retry, which retries a job several times after an increasing delay, and eventually either complete it or have it fail and log an error entry in our logging system. However, I am not sure how this should be implemented. I have already created a custom worker script which allows us to log error to our database, and my first attempt at retry was something along the lines of this:
# This handler would ideally wait some time, then requeue the job.
def worker_retry_handler(job, exc_type, exc_value, tb):
print 'Doing retry handler.'
current_retry = job.meta[attr.retry] or 2
if current_retry >= 129600:
log_error_message('Job catastrophic failure.', ...)
else:
current_retry *= 2
log_retry_notification(current_retry)
job.meta[attr.retry] = current_retry
job.save()
time.sleep(current_retry)
job.perform()
return False
As I mentioned, we also have a function in the worker file which correctly resolves the server to which it should connect, and can post jobs. The problem is not necessarily how to publish a job, but what to do with the job instance that you get in the exception handler.
Any help would be greatly appreciated. If there are suggestions or pointers on better ways to do this would also be great. Thanks!