I had a workload which have 16 instances, also they can communicate with each other (verified by ping
). Each of them was running a long time task and started like this:
nohup celery worker -A tasks.workers --loglevel=INFO --logfile=/dockerdata/log/celery.log --concurrency=7 >/dev/null 2>&1 &
However, after a while, there will always be a few instances of celery that will stop running, because normally the log directory will save every day's logs. I checked the last day's logs for these instances and found the following information:
worker exited by signal SIGKILL
[2021-07-23 09:04:24,270: ERROR/MainProcess] Process 'ForkPoolWorker-19773' pid:2846586 exited with 'signal 9 (SIGKILL)'
[2021-07-23 09:04:24,281: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL) Job: 79074.')
Traceback (most recent call last):
File "/data/anaconda3/lib/python3.8/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 79074.
missed hearbeat from...
[2021-07-30 10:24:26,815: INFO/MainProcess] missed heartbeat from celery@instance-1
I suspect that the celery stop has something to do with the above two messages. Can anyone offer some solutions to this problem?