4

I use Celery/RabbitMQ for asynchronous task execution with my django application. I have just started working with Celery.

The tasks execute and everything works perfectly once I start the worker.

The Problem is that the tasks execution stops sometime later. After couple of hours, a day or sometimes couple of days. I realise that only from the consequences of incomplete task executions. Then I restart celery and all the pending tasks get executed and everything is back to normal.

My questions are:

  • How can I debug (where to start looking) to find out what the problem is?
  • How can I create a mechanism that shall notify me immediately after the problem starts?

My Stack: Django 1.4.8 Celery 3.1.16 RabbitMQ Supervisord

Thanks, andy

andy
  • 484
  • 8
  • 21
  • Have you tried RabbitMQ management plugin and see if there any issues with RabbitMQ queues at the time of stuck? That way you will be one step close that issue is not in RabbitMQ and in Celery(May be)? – Nikunj Jan 02 '15 at 06:04
  • @nIKUNJ will try to do that – andy Jan 02 '15 at 07:39
  • Does this answer your question? [Celery worker hangs without any error](https://stackoverflow.com/questions/30272845/celery-worker-hangs-without-any-error) – Nicolò Gasparini Dec 30 '20 at 18:55

1 Answers1

10

(1) If your celery worker get stuck sometimes, you can use strace & lsof to find out at which system call it get stuck.

For example:

$ strace -p 10268 -s 10000
Process 10268 attached - interrupt to quit
recvfrom(5,

10268 is the pid of celery worker, recvfrom(5 means the worker stops at receiving data from file descriptor.

Then you can use lsof to check out what is 5 in this worker process.

lsof -p 10268
COMMAND   PID USER   FD   TYPE    DEVICE SIZE/OFF      NODE NAME
......
celery  10268 root    5u  IPv4 828871825      0t0       TCP 172.16.201.40:36162->10.13.244.205:wap-wsp (ESTABLISHED)
......

It indicates that the worker get stuck at a tcp connection(you can see 5u in FD column).

Some python packages like requests is blocking to wait data from peer, this may cause celery worker hangs, if you are using requests, please make sure to set timeout argument.

(2)you can monitor your celery task queue size in RabbitMQ, if it keep increasing in a long time, probably celery worker is going on strike.


Have you seen this page:

https://www.caktusgroup.com/blog/2013/10/30/using-strace-debug-stuck-celery-tasks/

Gary Gauh
  • 4,984
  • 5
  • 30
  • 43
  • Mate, I owe you a beer for this. We've been stuck on this exact issue for 2 months, having to respawn all of our AWS servers every 4-5 days or so, because we weren't using `timeout` in requests. Thanks! – Jamie S Nov 02 '16 at 03:39