4

Is it possible to send a heartbeat to hangfire (Redis Storage) to tell the system that the process is still alive? At the moment I set the InvisibilityTimeout to TimeSpan.MaxValue to prevent hangfire from restarting the job. But, if the process fails or the server restarts, the job will never be removed from the list of running jobs. So my idea was, to remove the large time out and send a kind of heartbeat instead. Is this possible?

BendEg
  • 20,098
  • 17
  • 57
  • 131
  • when the server restarts are you having multiple instances running? – jayasurya_j Apr 19 '20 at 14:41
  • @jayasurya_j how do you mean that? I think not at the moment because hangfire does not recognize that the job dies. – BendEg Apr 19 '20 at 14:46
  • i wanted to write a job that listens to a queue forever (runs in infinite loop), i was thinking to set timeout to maxvalue but the since i use BackgroundJob.Enqueue().. in startup.cs i think everytime server starts or everytime new deployment happens we are enqueuing a forever running job. So not sure how to implement a forever running job in hangfire. Any idea? – jayasurya_j Apr 19 '20 at 14:52
  • 1
    @jayasurya_j having the same problem, so not at the moment :) There is the possibility to use background jobs for long-running processes. But they are not scheduled. And not shown in the dashboard. Furthermore, starting from the dashboard and after server start is also not possible. – BendEg Apr 19 '20 at 14:54
  • i don't mind scheduling it every minute too, but the problem is i am using postgresql(no options to configure expiration) and successful jobs are not automatically deleted. The data grows huge in a couple of days. Again, any idea? :P – jayasurya_j Apr 19 '20 at 15:09
  • @jayasurya_j personally I would write a job that cleans up the database. Maybe this would be the cleanest solution. – BendEg Apr 19 '20 at 15:33
  • For example, this job runs every hour or day, ... – BendEg Apr 19 '20 at 15:33
  • 1
    This is a long open bug on Hangfire. https://github.com/HangfireIO/Hangfire/issues/1197 so not sure that there actually is a good solution atm. Me, I would rewrite the job not be an eternal loop and just schedule itself again once finished. – fredrik Apr 25 '20 at 15:17

1 Answers1

2

I found https://discuss.hangfire.io/t/hangfire-long-job-stop-and-restart-several-time/4282/2 which deals with how to keep a long-running job alive in Hangfire. The User zLanger says that jobs are considered dead and restarted once you ...

[...] are hitting hangfire’s invisibilityTimeout. You have two options.

  • increase the timeout to more than the job will ever take to run
  • have the job send a heartbeat to let hangfire’s know it’s still alive.

That's not new to you. But interestingly, the follow-up question there is:

How do you implement heartbeat on job?

This remains unanswered there, a hint that that your problem is really not trivial.

I have never handled long-running jobs in Hangfire, but I know the problem from other queuing systems like the former SunGrid Engine which is how I got interested in your question.

Back in the days, I had exactly your problem with SunGrid and the department's computer guru told me that one should at any cost avoid long-running jobs according to some mathematical queuing theory (I will try to contact him and find the reference to the book he quoted). His idea is maybe worth sharing with you:

If you have some job which takes longer than the tolerated maximal running time of the queuing system, do not submit the job itself, but rather multiple calls of a wrapper script which is able to (1) start, (2) freeze-stop, (3) unfreeze-continue the actual task.

This stop-continue can indeed be a suspend (CTRL+Z respectively fg in Linux) on operating-system level, see e.g. unix.stackexchange.com on that issue.

In practice, I had the binary myMonteCarloExperiment.x and the wrapper-script myMCjobStarter.sh. The maximum compute time I had was a day. I would fill the queue with hundreds of calls of the wrapper-script with the boundary condition that only one at a time of them should be running. The script would check whether there is already a process myMonteCarloExperiment.x started anywhere on the compute cluster, if not, it would start an instance. In case there was a suspended process, the wrapper script would forward it and let it run for 23 hours and 55 minutes, and suspend the process then. In any other case, the wrapper script would report an error.

This approach does not implement a job heartbeat, but it does indeed run a lengthy job. It also keeps the queue administrator happy by avoiding that job logs of Hangfire have to be cleaned up.

Further references

B--rian
  • 5,578
  • 10
  • 38
  • 89