3

I'm using a RQ worker to handle a large number of jobs, and I'm having problems.

Observations

  • Job returns work-horse terminated unexpectedly; waitpid returned None
  • The job connects to a database and simply runs several SQL statement like a simple insert or delete statement.
  • The error message happens almost immediately: within several seconds of starting.
  • Sometimes the jobs run fine and there is no issue.
  • On one of the jobs, I can see it does an insert but then just returns the error.
  • On the rq worker, I see the following log entries.
{"message": "my_queue: my_job() (dcf797c4-1434-4b77-a344-5bbb1f775113)"}
{"message": "Killed horse pid 8451"}
{"message": "Moving job to FailedJobRegistry (work-horse terminated unexpectedly; waitpid returned None)"}
  • Digging into the rq code (https://github.com/rq/rq), the "Killed horse pid..." line is a tip off that RQ is deliberately killing the job itself. The only place that the job killing code happens is in the following snippet. To reach the self.kill_horse() line, a HorseMonitorTimeoutException has to happen and the utcnow - job.started_at difference has to be > job.timeout (timeout is huge btw).
        while True:
            try:
                with UnixSignalDeathPenalty(self.job_monitoring_interval, HorseMonitorTimeoutException):
                    retpid, ret_val = os.waitpid(self._horse_pid, 0)
                break
            except HorseMonitorTimeoutException:
                # Horse has not exited yet and is still running.
                # Send a heartbeat to keep the worker alive.
                self.heartbeat(self.job_monitoring_interval + 5)

                # Kill the job from this side if something is really wrong (interpreter lock/etc).
                if job.timeout != -1 and (utcnow() - job.started_at).total_seconds() > (job.timeout + 1):
                    self.kill_horse()
                    break
  • Sometimes the jobs hang around in the queue for a long time before the worker actually gets to them. I would've expected started_at to be reset though. This assumption may be wrong.
  • The jobs are created using rq_scheduler and they fire regularly using a cron string (11pm every day, etc.)

What should my next steps be for this?

mj_
  • 6,297
  • 7
  • 40
  • 80

1 Answers1

1

I think that the latest version of RQ (https://github.com/rq/rq/releases/tag/v1.4.0) has the solution.

Fixed a bug that may cause early termination of scheduled or requeued jobs. Thanks @rmartin48!

mj_
  • 6,297
  • 7
  • 40
  • 80