5

I have a setup where I run long idempotent tasks on AWS spot instances but I can't work out how to set up Celery to elegantly handle workers being killed mid task.

At the moment if a worker is killed the task is marked as failed (WorkerLostError). I found the documentation on the subject to be a bit lean, but it suggests that you should use CELERY_ACKS_LATE for this scenario. This isn't working for me, the task is still marked as failed.

When I had CELERY_ACKS_LATE=False the task just stayed stuck as PENDING - so at least now I can tell that it has failed - which is a good start.

Here are my config settings at the moment:

# I'm using rabbit-mq as the broker
BROKER_HEARTBEAT = 10
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
CELERY_TRACK_STARTED = True

I have a task spinning on a master server that checks for the results of outstanding tasks and handles updating my local db to mark the tasks as complete (and performs work with the results). At this stage I think I'm going to have to catch the 'Worker exited prematurely: signal 15 (SIGTERM)' scenario and retry the task.

It feels like this should all be handled by celery, so I feel like I've missed something fundamental in my config.

Given idempotent tasks and workers that will fail, what is the best way to configure celery so that those tasks are picked up by a different worker?

Aidan Kane
  • 3,856
  • 2
  • 25
  • 28
  • Redelivering a task that crashed the process is a bad idea since that will most likely cause the crash to happen again (and in a loop). – asksol Jan 23 '14 at 16:05
  • That's not the situation though. I want to know what to do when it's not the task at fault - but the server. I have servers that can be killed at any moment. The worker fails, not the task. Also, my tasks can be rerun if they're killed. – Aidan Kane Jan 23 '14 at 23:27
  • In that case you could submit a patch that let you change that behavior. It could also enable the ability to reject such a message so that the dead-letter facilities in rabbitmq can be used. – asksol Jan 27 '14 at 14:50

0 Answers0