I'm using a RQ worker to handle a large number of jobs, and I'm having problems.
Observations
- Job returns
work-horse terminated unexpectedly; waitpid returned None
- The job connects to a database and simply runs several SQL statement like a simple insert or delete statement.
- The error message happens almost immediately: within several seconds of starting.
- Sometimes the jobs run fine and there is no issue.
- On one of the jobs, I can see it does an insert but then just returns the error.
- On the rq worker, I see the following log entries.
{"message": "my_queue: my_job() (dcf797c4-1434-4b77-a344-5bbb1f775113)"}
{"message": "Killed horse pid 8451"}
{"message": "Moving job to FailedJobRegistry (work-horse terminated unexpectedly; waitpid returned None)"}
- Digging into the rq code (https://github.com/rq/rq), the "Killed horse pid..." line is a tip off that RQ is deliberately killing the job itself. The only place that the job killing code happens is in the following snippet. To reach the
self.kill_horse()
line, aHorseMonitorTimeoutException
has to happen and theutcnow - job.started_at
difference has to be > job.timeout (timeout is huge btw).
while True:
try:
with UnixSignalDeathPenalty(self.job_monitoring_interval, HorseMonitorTimeoutException):
retpid, ret_val = os.waitpid(self._horse_pid, 0)
break
except HorseMonitorTimeoutException:
# Horse has not exited yet and is still running.
# Send a heartbeat to keep the worker alive.
self.heartbeat(self.job_monitoring_interval + 5)
# Kill the job from this side if something is really wrong (interpreter lock/etc).
if job.timeout != -1 and (utcnow() - job.started_at).total_seconds() > (job.timeout + 1):
self.kill_horse()
break
- Sometimes the jobs hang around in the queue for a long time before the worker actually gets to them. I would've expected started_at to be reset though. This assumption may be wrong.
- The jobs are created using rq_scheduler and they fire regularly using a cron string (11pm every day, etc.)
What should my next steps be for this?