I've been working on a systemd service to wrap an administration script and I'm trying to gracefully handle it completely breaking.
Right now I have Restart
set to always
so it will try again when something fails, but some failure states require attention (missing config file, bad SQL, etc), so I don't want it continuously spinning in the background in an uncorrectable state.
I found StartLimitInterval
, StartLimitBurst
, and StartLimitAction
, which stops trying to restart it after X failures in Y seconds, but it turns out that the only actions available for StartLimitAction
are rebooting or shutting down the machine, which is a little overkill.
I've been looking at OnFailure
and wrote a mini service to send an alert email when it's triggered, but OnFailure triggers every time the service dies, not when it hits the start limit, so we get a bunch of emails instead of just one.
Any ideas of what to try next?