rq (redis queue) work-horse terminated unexpectedly, suggestions on how to debug?

Question

I'm using a RQ worker to handle a large number of jobs, and I'm having problems.

Observations

Job returns work-horse terminated unexpectedly; waitpid returned None
The job connects to a database and simply runs several SQL statement like a simple insert or delete statement.
The error message happens almost immediately: within several seconds of starting.
Sometimes the jobs run fine and there is no issue.
On one of the jobs, I can see it does an insert but then just returns the error.
On the rq worker, I see the following log entries.

{"message": "my_queue: my_job() (dcf797c4-1434-4b77-a344-5bbb1f775113)"}
{"message": "Killed horse pid 8451"}
{"message": "Moving job to FailedJobRegistry (work-horse terminated unexpectedly; waitpid returned None)"}

Digging into the rq code (https://github.com/rq/rq), the "Killed horse pid..." line is a tip off that RQ is deliberately killing the job itself. The only place that the job killing code happens is in the following snippet. To reach the self.kill_horse() line, a HorseMonitorTimeoutException has to happen and the utcnow - job.started_at difference has to be > job.timeout (timeout is huge btw).

        while True:
            try:
                with UnixSignalDeathPenalty(self.job_monitoring_interval, HorseMonitorTimeoutException):
                    retpid, ret_val = os.waitpid(self._horse_pid, 0)
                break
            except HorseMonitorTimeoutException:
                # Horse has not exited yet and is still running.
                # Send a heartbeat to keep the worker alive.
                self.heartbeat(self.job_monitoring_interval + 5)

                # Kill the job from this side if something is really wrong (interpreter lock/etc).
                if job.timeout != -1 and (utcnow() - job.started_at).total_seconds() > (job.timeout + 1):
                    self.kill_horse()
                    break

Sometimes the jobs hang around in the queue for a long time before the worker actually gets to them. I would've expected started_at to be reset though. This assumption may be wrong.
The jobs are created using rq_scheduler and they fire regularly using a cron string (11pm every day, etc.)

What should my next steps be for this?

score 1 · Accepted Answer · answered May 13 '20 at 16:23

1

I think that the latest version of RQ (https://github.com/rq/rq/releases/tag/v1.4.0) has the solution.

Fixed a bug that may cause early termination of scheduled or requeued jobs. Thanks @rmartin48!

answered May 13 '20 at 16:23

mj_

6,297
7
40
80

rq (redis queue) work-horse terminated unexpectedly, suggestions on how to debug?

Observations

1 Answers1