Is there way to run code before Sidekiq is restarted in the middle of a job?

Question

I have a Sidekiq job that runs every 4 minutes.

This job checks if the current code block is being executed before executing the code again

process = ProcessTime.where("name = 'ad_queue_process'").first

# Return if job is running
return if process.is_running == true

If Sidekiq restarts midway through the code block, code that updates the status of the job never runs

# Done running, update the process times and allow it to be ran again
process.update_attributes(is_running: false, last_execution_time: Time.now)

Which leads the the Job never running unless i run an update statement to set is_running = false

Is there any way to execute code before Sidekiq is restarted?

How are you restarting Sidekiq? It's normally stopped/restarted "gracefully", meaning it will finish all running jobs first before restarting/stopping: [reference](https://stackoverflow.com/questions/24446157/rails-how-to-restart-sidekiq/24446989) But if you CTRL+C in development, yes that will close immediately, but this is in development. If you want to make sure that the "update" will only commit in the DB if there are no errors (i.e. only if Sidekiq is not CTRL+C-ed in dev env in the middle of the job), then you can wrap the whole job in a `ProcessTime.transaction do ... end` block — Jay-Ar Polidario, Feb 05 '19 at 16:46
_Sidenote:_ this is in any case the wrong approach, vulnerable to race conditions. One should use a messaging queue acking the messages _after_ the job finishes. Another (low level) option would be a `Mutex`/`ConditionalVariable`. All other solutions will sooner or later lead to the race condition and concurrent execution of two jobs simultaneously. — Aleksei Matiushkin, Feb 05 '19 at 18:00
@Jay-ArPolidario My app is hosted on Heroku, so if i push to production or the daily server restart happens, Sidekiq restarts, today i pushed to production and it was midway through the job so it never reached the update statement — ricks, Feb 05 '19 at 20:01
@AlekseiMatiushkin Hey, thanks for the advice, could you elaborate a bit on why this is not a good approach, currently we store what the job requires in a table and the job only runs if it is currently not running, the code block is inside of a begin rescue that updates the job to not running incase code that is being executed fails. I looked up what a race condition is and its when the job would run twice at the same time, how do you think this could happen with my current implementation? Could you give me a brief example on how you would approach this — ricks, Feb 05 '19 at 20:04
@RickS Oh i see, you are using Sidekiq in Heroku. Haven't used it there before, but I found out the reason why it's closing mid-way (probably your job is taking > 30 seconds to run?). From [this doc](https://github.com/mperham/sidekiq/wiki/Deployment), it says `"Keep in mind that Heroku puts a hard limit of 30 seconds on a process restart, the -t 25 tells Sidekiq to give the jobs 25 seconds to finish before starting the "force shutdown" procedure"` — Jay-Ar Polidario, Feb 05 '19 at 20:16
@Jay-ArPolidario It is possible that it may have been running longer than the time heroku gives for the job to finish, usually this isn't an issue since most jobs just get re-queued after the server restarts, but since my code, at the end, runs an update statement to allow the job to run again, re-queuing will let the job run but once it checks if it is currently running then it will not execute any code. I am not sure if theres a way to execute code incase it gets shut down without the job executing or if i will need to change how this job works — ricks, Feb 05 '19 at 20:21
@RickS Looking at [this](https://devcenter.heroku.com/articles/what-happens-to-ruby-apps-when-they-are-restarted), seems like Heroku sends a "SIGTERM" to the processes (presumably this also applies to the sidekiq process), because it's a "SIGTERM" and not a "SIGKILL" (unrescueable force shutdown), then I guess you could still rescue it around your `perform` method (p.s. untested), but you can try: `def perform; # code here...; rescue SignalException => e; ensure; process.update(...); end` — Jay-Ar Polidario, Feb 05 '19 at 20:43
@RickS although, reading further on that Heroku page: `After Heroku sends SIGTERM to your application, it will wait a few seconds and then send SIGKILL to force it to shut down, even if it has not finished cleaning up. In this example, the ensure block does not get called at all, the program simply exits:`. Therefore if your job "hangs"/takes a long time shutting down, then my "rescue; ensure" solution still not purely reliable but hopefully it's not gonna take long, because you are just doing an `update` anyway in the `ensure` block; still not 100% reliable i.e. temporary DB timeout on update — Jay-Ar Polidario, Feb 05 '19 at 20:49
Oops, I forgot to reraise the exception and I could no longer edit my comment above. So here: `def perform; # code here...; rescue SignalException => e; raise; ensure; process.update(...); end` — Jay-Ar Polidario, Feb 05 '19 at 21:34
Nevermind my code in the comments above, I found out that `rescue ...` is no longer needed, because just only `ensure` already works after testing it. See my answer below. — Jay-Ar Polidario, Feb 05 '19 at 21:47

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

5

Update:

Thanks to @Aaron, and following our discussion (comments below), the ensure block (which is executed by the forked worker-threads) can only be ran for a few unguaranteed milliseconds before the main-thread forcefully terminates these worker-threads, in order for the main-thread to do some "cleanup" up the exception stack, in order to avoid getting SIGKILL-ed by Heroku. Therefore, make sure that your ensure code should be really fast!

TL;DR:

def perform(*args)
  # your code here
ensure
  process.update_attributes(is_running: false, last_execution_time: Time.now)
end

The ensure above is always called regardless if the method "succeeded" or an Exception is raised. I tested this: see this repl code, and click "Run"
In other words, this is always called even on a SignalException, even if the signal is SIGTERM (gracefully shutdown signal), but ONLY EXCEPT on SIGKILL (force unrescueable shutdown). You can verify this behaviour by checking my repl code, and then change Process.kill('TERM', Process.pid) to Process.kill('KILL', Process.pid), and then click "run" again (you'll notice that the puts won't be called)
Looking at Heroku docs, I quote:

When Heroku is going to shut down a dyno (for a restart or a new deploy, etc.), it first sends a SIGTERM signal to the processes in the dyno.

After Heroku sends SIGTERM to your application, it will wait a few seconds and then send SIGKILL to force it to shut down, even if it has not finished cleaning up. In this example, the ensure block does not get called at all, the program simply exits

... which means that the ensure block will be called because it's a SIGTERM and not a SIGKILL, only except if the shutting down takes a looong time, which may due to (some reasons I could think of ATM):
- Something inside your perform code (or any ruby code in the stack; even gems) that also rescued the SignalException, or even rescued the root Exception class because SignalException is a subclass of Exception) but takes a long time cleaning up (i.e. cleaning up connections to DB or something, or I/O stuff that hangs your application)
- Or, your own ensure block above takes a looong time. I.E when doing the process.update_attributes(...), for some reason the DB temporary hangs / network delay or timeout, then that update might not succeed at all! and will ran out of time, of which from my quote above, after a few seconds after the SIGTERM, the application will be forced to be stopped by Heroku sending a SIGKILL.

... which all means that my solution is still not fully reliable, but should work under normal situations

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 05 '19 at 22:24

Jay-Ar Polidario

6,463
14
28

I'm trying some solutions for this very problem, and this particular solution seems promising. But what I'm finding is that once Sidekiq raises `Sidekiq::Shutdown` on the worker thread, it does not wait for the thread to finish before `exit`ing the process. This means with this solution, there is a race between your `ensure` block finishing and the process exiting. I so want this solution to work, but without some kind of tweak, your `ensure` block will have to be _very_ fast to have any hope of finishing. I'm working through that now... – Aaron Feb 27 '19 at 17:20
@Aaron Yes. you are right and as I have said that the `ensure` block should run very fast (and that any ruby code that rescued the Interrupt (TERM signal) exception should also run very quickly). Looking at this [docs](https://github.com/mperham/sidekiq/wiki/Signals#term), seems like the total time before Sidekiq::Shutdown is raised is after 8 seconds (because Heroku sends the SIGKILL at the 10th second). So the `ensure` block above should finish in ~2 seconds, but let's say SHOULD finish in 1 second, so that any ruby code higher in the stack can only have a chance to rescue the exception) – Jay-Ar Polidario Feb 27 '19 at 19:35
However, the said 8 seconds limit is `Sidekiq` stuff, because what happens is (from what I understood, correct me if im wrong). `begin; Process.kill('TERM', Process.pid); rescue Exception; raise_sidekiq_shutdown_after_8_seconds do # then Sidekiq waits for the job to finish; end; rescue Sidekiq::Shutdown => # oh no! the job didn't finish after 8 seconds!, so me (Sidekiq) only has 2 seconds to do internal cleanup (which includes pushing back the job to the retry queue); ensure; # then finally your ensure code gets run here! and thus it should be fast! #end` – Jay-Ar Polidario Feb 27 '19 at 19:41
sorry typo in the previous previous message: "so that any ruby code higher in the stack can [also] have a chance to rescue the exception". Also, P.S. What I said in my comment above is not tested! I haven't seen the Sidekiq process ruby code, nor haven't done any debugging yet. But all of the above was just through my past observations and assumptions. So, I'd be curious to know as well the facts, and would be interested to know more about your findings! :) – Jay-Ar Polidario Feb 27 '19 at 20:20
In my experience working through this, I'm finding that we have something more like a few milliseconds for the `ensure` block to complete before the process exits. At https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/manager.rb#L74 we see that Sidekiq waits for the threads to finish, then raises `Sidekiq::Shutdown` in each thread via `Processor#kill`, then immediately (without waiting) goes back up the call stack where it calls `exit(0)`. So it doesn't wait 8 seconds after sending the shutdown signal - it just exits. – Aaron Mar 21 '19 at 15:52
Hmmm that's interesting. `Sidekiq::Shutdown` seems to be a subclass of an `Interrupt`, so the `ensure` should still have been called in the forked-threads (of which the jobs are running and the `ensure` block is placed)), so long as the main-thread does not terminate immediately. [Sidekiq docs](https://github.com/mperham/sidekiq/wiki/Signals#term) seem to suggest though that there are still roughly around 5 seconds after the 25th second waiting time, for the cleanup to happen via `Sidekiq::Shutdown`. I'm just curious though if your specific configuration has an effect to this behaviour. – Jay-Ar Polidario Mar 21 '19 at 16:22
@Aaron ...i.e. maybe your Sidekiq timeout is configured to be 29 seconds (`-t 29`), and therefore, `Sidekiq::Shutdown` only gets raised at the 29th second, and only has 1 second before `Heroku` sends the SIGKILL at the 30th second. but thanks for your detailed experience on this particular topic. Should I get the time, I'll try to recreate the scenarios as well in Heroku + Sidekiq – Jay-Ar Polidario Mar 21 '19 at 16:24
"so long as the main-thread does not terminate immediately" - that's the problem, the main thread is terminating (almost) immediately. At the point in sidekiq's code where the `ensure` block gets run is the same place where sidekiq goes up to the call stack and `exit`s. During that time between the timeout and when Heroku sends `SIGKILL`, Sidekiq _shuts itself down_ in order to avoid getting force-killed. – Aaron Apr 05 '19 at 12:51
@Aaron I just set up a heroku sample rails+sidekiq app to replicate the scenario, and yes I can confirm that you are correct that using a `Procfile`: `worker: bundle exec sidekiq -t 25 -c 2` configuration, the `ensure` block can only run for a few milliseconds, and I can see in the logs that it is the main thread terminating the busy worker thread, while the worker thread is "cleaning" or running the `ensure` block (I set my `ensure` block to loop forever to test this out). I thought it would be only heroku sending the SIGKILL (after the 30th second) that will eventually kill the process... – Jay-Ar Polidario Apr 05 '19 at 14:28
but... from this test result, it's also the main-thread Sidekiq process (probably it has internal timing mechanism that tracks the forked worker threads)... that also terminates the `ensure` block (specifically speaking, the "thread"), probably for good reason. Maybe before Heroku literally sends the SIGKILL at the 30th second, the main-thread needs to do "something else"/cleanup after all of the `ensure` blocks finished, otherwise the main-thread would also just immediately be stopped by the SIGKILL while still waiting for the worker-threads to completely finish. – Jay-Ar Polidario Apr 05 '19 at 14:32
Unfortunately, I couldn't think of any other good solution unless probably Sidekiq has an API to handle this (of which I am not yet aware of), or that we can probably monkey-patch some of the Sidekiq code that deals with the termination of the forked worker-threads (though probably best not to). I was also thinking/looking for a way to hijack the main-thread execution by hopefully putting it into sleep first (right after the `ensure` block is called) so that the main thread won't be able to preemptively terminate the worker-thread, but... couldn't find a working solution unfortunately. – Jay-Ar Polidario Apr 05 '19 at 14:43
1

I sure will let you know! The best thing I've been able to think of so far is to monkey-patch sidekiq to wait for the thread (at least for a few seconds) before killing the process. But that makes me nervous. Thank you for thinking this through with me! – Aaron Apr 16 '19 at 15:37

score 0 · Answer 2 · edited Feb 05 '19 at 22:53

0

Handle sidekiq shutdown exception

class SomeWorker
  include Sidekiq::Worker

  sidekiq_options queue: :default

  def perform(params)
    ...

  rescue Sidekiq::Shutdown
    SomeWorker.perform_async(params)
  end
end

edited Feb 05 '19 at 22:53

Raj

22,346
14
99
142

answered Feb 05 '19 at 17:31

NeverBe

5,213
2
25
39

Hey, according to this guide https://www.rubydoc.info/github/mperham/sidekiq/Sidekiq/Shutdown it says "This is needed to rollback db transactions, otherwise Ruby's Thread#kill will commit....DO NOT RESCUE THIS ERROR IN YOUR WORKERS" so i don't think this is the best solution – ricks Feb 05 '19 at 20:13
but you can reraise it after handling – NeverBe Feb 05 '19 at 20:27

Is there way to run code before Sidekiq is restarted in the middle of a job?

2 Answers2

Update: