Working around the stale pidfile problem after hard restart kills my daemon

Question

I'm using Red Hat Linux (RHEL5) on a (VMWare) VM. I've written a daemon which should stay running all the time and automatically run on boot.

Last night the VM host had an unrecoverable hardware problem and the VM abruptly halted. When it came back, my daemon didn't start because the pidfile still existed.

Apparently this is called The Stale pidfile Syndrome but I'm not sure what's the best long-term approach for mitigating it. I'm thinking that the startup script in /etc/rc.d* should delete the pidfile before starting the daemon, but the service management script in /etc/init.d should remain the same so things like service mydaemon start doesn't clobber the pidfile.

/etc/rc.d/rc6.d just has a symlink to the script in /etc/init.d/, so how should I change how it behaves only on boot? I can make an additional script with higher precedence in the rc.d dirs, but it seems hacky. Someone also suggested adding logic like "if uptime is less than 1 minute, delete the pidfile" but that seems hacky too.

Any thoughts or solutions or best practices?

score 3 · Answer 1 · answered Feb 23 '11 at 19:25

3

Use daemontools and see Process Management.

answered Feb 23 '11 at 19:25

Dennis Williamson

62,149
16
116
151

Very good link, the `kill -0` technique is very clever. – coredump Feb 23 '11 at 19:34
But as the ["The Stale pidfile Syndrome"](http://perfec.to/stalepid.html) I referenced noted, how do I know, say after a hard boot like today, that the process at that pid is my daemon and not something else that responds to `kill -0`? Also, daemontools' `supervise` addresses a different problem. My daemon is very stable and runs for months, but it didn't start this morning after the hard reboot because the pidfile from last run still existed. – Nathan Feb 23 '11 at 19:58
1

@Nathan: `pid=$(cat pidfile); process=$(ps -p $pid o cmd=); if [ "$process" != "me" ]; then echo "that's definitely not me"; else "maybe that's me"; fi` and you can perform some additional sanity checks before deleting the pidfile and continuing. Do `grep -l stale /etc/init.d/* | xargs less -p stale` (press Alt-n to jump to successive occurrences of the search pattern) and see how some of the other daemons are doing it. Also, look at `/lib/lsb/init-functions` for some useful stuff (you can do the search above substituting the names of some of the function for "stale" to see how they're used). – Dennis Williamson Feb 23 '11 at 20:23

score 3 · Answer 2 · answered Feb 24 '11 at 02:26

3

Thank you for the hints @Dennis and @coredump.

I found out some additional information that helped me unravel the mystery.

I wondered why every other daemon recovered fine. It turns out there is code in /etc/rc.d/rc.sysinit to clean up all pidfiles in /var/run and /var/lock at boot.
I had configured my daemon to put its pidfile elsewhere because of trouble with SELinux preventing me from "using potentially mislabeled files".

So I haven't fixed it yet due to the SELinux issues, but the answer I think is "put your pidfile in /var/run or /var/lock and it will work next time"

answered Feb 24 '11 at 02:26

Nathan

360
1
5
10

... and as it turns out, it was not necessarily SELinux that was causing my problem, but too many hardcoded pidfile locations in the daemon and its init script. – Nathan Feb 24 '11 at 21:51
I think from the Filesystem Directory Standard now the right pathname is just `/run`. In fact, your `/var/run` is probably a symbolic link to `/run`. https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard – Valerio Bozzolan Aug 24 '22 at 13:48

coredump · Answer 3 · 2011-02-23T19:31:05.580

1

The script is the same, the startup process just executes the 'start' action on the sysvinit scripts.

Why don't you check if the pid on the pid file is right, and if don't delete it and create a new one with the right pid?

EDIT: You can grep ps with the pidfile to see if the process still exists. Or do the other way around. Check RedHat initscripts, I am sure they have some helper functions to do that, like pidofproc.

edited Feb 23 '11 at 19:31

answered Feb 23 '11 at 19:06

coredump

12,713
2
36
56

Try kill -0 – profy Feb 23 '11 at 19:40

score 0 · Answer 4 · answered Dec 05 '21 at 17:56

Even these days, with systemd being around, it's not uncommon that PID files are still used and misplaced. Particularly, for software not shipped with the distribution. Usually, I see that some other process is using the same PID as a service used before a (forceful) reboot and then thinks the service is already running. Sadly, this isn't unlikely as PID starts at one again after reboot and is then increment by one for every process started. I, ultimately, decided to just create a service that sets a random PID during early boot. This should make it very unlikely that a PID is reused.

To this end, I created this service at /etc/systemd/system/randomize-pid.service:

[Unit]
Description=Set next upcoming PID to a random value
DefaultDependencies=no

# sysctl may be used to adjust highest allowed PID (pid_max)
After=systemd-sysctl.service

Conflicts=shutdown.target
Before=shutdown.target

# Run before starting "regular" services that have
# DefaultDependencies set to yes.
Before=sysinit.target

ConditionPathExists=/proc/sys/kernel/pid_max
ConditionPathExists=/proc/sys/kernel/ns_last_pid
ConditionPathIsReadWrite=/proc/sys/kernel

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/sh -c 'shuf -n 1 -i 1-$(cat /proc/sys/kernel/pid_max) > /proc/sys/kernel/ns_last_pid'
TimeoutSec=90s

[Install]
WantedBy=multi-user.target

And then enabled it:

systemctl enable randomize-pid.service

This requires checkpoint/restart to be enabled in the kernel but I believe this to be the default on most distributions.

Working around the stale pidfile problem after hard restart kills my daemon

4 Answers4