69

I use Celery with RabbitMQ in my Django app (on Elastic Beanstalk) to manage background tasks and I daemonized it using Supervisor. The problem now, is that one of the period task that I defined is failing (after a week in which it worked properly), the error I've got is:

[01/Apr/2014 23:04:03] [ERROR] [celery.worker.job:272] Task clean-dead-sessions[1bfb5a0a-7914-4623-8b5b-35fc68443d2e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
  File "/opt/python/run/venv/lib/python2.7/site-packages/billiard/pool.py", line 1168, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).

All the processes managed by supervisor are up and running properly (supervisorctl status says RUNNNING).

I tried to read several logs on my ec2 instance but no one seems to help me in finding out what is the cause of the SIGKILL. What should I do? How can I investigate?

These are my celery settings:

CELERY_TIMEZONE = 'UTC'
CELERY_TASK_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
BROKER_URL = os.environ['RABBITMQ_URL']
CELERY_IGNORE_RESULT = True
CELERY_DISABLE_RATE_LIMITS = False
CELERYD_HIJACK_ROOT_LOGGER = False

And this is my supervisord.conf:

[program:celery_worker]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery worker -A com.cygora -l info --pidfile=/opt/python/run/celery_worker.pid
startsecs=10
stopwaitsecs=60
stopasgroup=true
killasgroup=true
autostart=true
autorestart=true
stdout_logfile=/opt/python/log/celery_worker.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_worker.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1

[program:celery_beat]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery beat -A com.cygora -l info --pidfile=/opt/python/run/celery_beat.pid --schedule=/opt/python/run/celery_beat_schedule
startsecs=10
stopwaitsecs=300
stopasgroup=true
killasgroup=true
autostart=false
autorestart=true
stdout_logfile=/opt/python/log/celery_beat.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_beat.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1

Edit 1

After restarting celery beat the problem remains.

Edit 2

Changed killasgroup=true to killasgroup=false and the problem remains.

damon
  • 14,485
  • 14
  • 56
  • 75
daveoncode
  • 18,900
  • 15
  • 104
  • 159
  • Hint: Most probably it's due to low memory/ram on your server. If you're running containers through docker command, you can see the memory consumption of each container using `docker stats`. – Krishna Jul 29 '21 at 12:38

2 Answers2

71

The SIGKILL your worker received was initiated by another process. Your supervisord config looks fine, and the killasgroup would only affect a supervisor initiated kill (e.g. the ctl or a plugin) - and without that setting it would have sent the signal to the dispatcher anyway, not the child.

Most likely you have a memory leak and the OS's oomkiller is assassinating your process for bad behavior.

grep oom /var/log/messages. If you see messages, that's your problem.

If you don't find anything, try running the periodic process manually in a shell:

MyPeriodicTask().run()

And see what happens. I'd monitor system and process metrics from top in another terminal, if you don't have good instrumentation like cactus, ganglia, etc for this host.

Nino Walker
  • 2,742
  • 22
  • 30
  • 2
    You are right "celery invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0" ...now I have to find why this happen, because previously it was running as expected :P Thank you very much! – daveoncode Apr 03 '14 at 19:25
  • 14
    @daveoncode I think Lewis Carol once wrote, ""Beware the oom-killer, my son! The jaws that bite, the claws that catch!" – Nino Walker Apr 03 '14 at 21:25
  • 26
    On my Ubuntu box the log to check is `/var/log/kern.log`, not `/var/log/messages` – Rick Mohr Dec 09 '14 at 21:55
  • 4
    in my Ubuntu box it is `/var/log/syslog` (so much for consistency) – dashesy May 25 '15 at 21:35
  • 3
    @daveoncode how did you go about finding why that happens. im also stuck at a similar position. and the problem is its happening for only one task and the according to compute engine it everything about cpu usage and memory seems fine – Buddhi741 Feb 13 '18 at 00:59
  • 3
    I figured out it was a memory issue. – varnothing Sep 09 '18 at 06:12
  • 14
    I was running celery workers on ecs with too little RAM per task and I also saw oom killing processes. So its not always related to memory leaks, but can also be the cause of not enough RAM. – J Selecta Aug 13 '19 at 15:46
  • @JSelecta Thanks. It really helped. The same issue was with my server. I was running multiple container on 1GB ram server. The celery worker needed 400MB for a specific task. When I upgraded to 2GB, it's working fine. – Krishna Jul 29 '21 at 12:15
10

One sees this kind of error when an asynchronous task (through celery) or the script you are using is storing a lot of data in memory because it leaks.

In my case, I was getting data from another system and saving it on a variable, so I could export all data (into Django model / Excel file) after finishing the process.

Here is the catch. My script was gathering 10 Million data; it was leaking memory while I was gathering data. This resulted in the raised Exception.

To overcome the issue, I divided 10 million pieces of data into 20 parts (half a million on each part). I stored the data in my own preferred local file / Django model every time the length of data reached 500,000 items. I repeated this for every batch of 500k items.

No need to do the exact number of partitions. It is the idea of solving a complex problem by splitting it into multiple subproblems and solving the subproblems one by one. :D

Farid Chowdhury
  • 2,766
  • 1
  • 26
  • 21