5

I have the following code:

class StatusTask(automata_celery.Task):

  def on_success(self, retval, task_id, args, kwargs):
    with app.app_context():
      cloaker = Cloaker.query.get(args[0])
      cloaker.status = RemoteStatus.LAUNCHED
      db.session.commit()

  def on_failure(self, exc, task_id, args, kwargs, einfo):
    with app.app_context():
      cloaker = Cloaker.query.get(args[0])
      cloaker.status = RemoteStatus.ERROR
      db.session.commit()


@celery.task(base=StatusTask)
def deploy_cloaker(cloaker_id):
  """To prevent launching while we are launching, we will
  disable launching until the cloaker's status is LAUNCHED
  """
  cloaker = Cloaker.query.get(cloaker_id)
  if not cloaker.can_launch():
    return

  cloaker.status = RemoteStatus.LAUNCHING
  db.session.commit()

  host = cloaker.server.ssh_user + '@' + cloaker.server.ip
  execute(fabric_deploy_cloaker, cloaker, hosts=host)


def fabric_deploy_cloaker(cloaker):
  domain = cloaker.domain
  sudo('rm -rf /var/www/%s/html' % domain)          # Restartable process
  sudo('mkdir -p /var/www/%s/html' % domain)

When I supply a faulty ip address for my fabric to ssh to (1.2.3.4), Celery worker exits prematurely but doesn't execute the on_failure handler.

Look at the log it generates on my celery worker window:

[2017-07-31 01:04:45,231: WARNING/PoolWorker-8] [root@1.2.3.45] Executing task 'fabric_deploy_cloaker'
[2017-07-31 01:04:45,231: WARNING/PoolWorker-8] [root@1.2.3.45] sudo: rm -rf /var/www/google.com/html
[2017-07-31 01:04:55,328: WARNING/PoolWorker-8] Fatal error: Timed out trying to connect to 1.2.3.45 (tried 1 time)

Underlying exception:
    timed out
[2017-07-31 01:04:55,328: WARNING/PoolWorker-8] Aborting.
[2017-07-31 01:04:59,126: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.',)
Traceback (most recent call last):
  File "/Users/vng/.virtualenvs/AutomataHeroku/lib/python2.7/site-packages/billiard/pool.py", line 1224, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: exitcode 0.

However, when I inspect the state of this task, I see the following: state=FAILURE status=FAILURE message=Worker exited prematurely: exitcode 0.

How can I handle this error gracefully?

My application needs to set cloaker.status to either LAUNCHED or ERROR so that my end-users can relaunch this task manually.

Tinker
  • 4,165
  • 6
  • 33
  • 72

1 Answers1

-1

I've faced the same problem in my project, and found two possible workarounds:

First is to avoid duplication (and synchronization!) of celery.state and your own app state RemoteStatus.LAUNCHED. You will have to store AsyncResult from the apply_async() or at least id of the task.

Second is to wrap actions that may lead to the WorkerLostError into try/except:

  host = cloaker.server.ssh_user + '@' + cloaker.server.ip
  try:
      assert_execute(fabric_deploy_cloaker, cloaker, hosts=host)
  except Exception:
      raise FabricDeployError("Something went wrong")
  else:
      execute(fabric_deploy_cloaker, cloaker, hosts=host)
Sergey Belash
  • 1,433
  • 3
  • 16
  • 21
  • Hi, where do you get `assert_execute`? I can't find that method in Fabric at all. `execute` alone does not raise exception for Worker Timeout – Tinker Jul 31 '17 at 09:07
  • Yeah, there should not be a such method, because I suppose you to write it. You have to figure out why you get your worker killed (see https://stackoverflow.com/questions/22805079/celery-workerlosterror-worker-exited-prematurely-signal-9-sigkill?rq=1) and then write kind of assertion for this bad case – Sergey Belash Jul 31 '17 at 09:15