For mysterious reasons the machines that build our Hadoop cluster seem to experience waves of SIGHUP
. All boxes are running centos 6.7/8 and Cloudera (CM+CDH) 5.9.
When such SIGHUP
wave occurs in one machine, I see processes getting stuck (some from Hadoop, some native to the OS like ntpd
), and traces of SIGHUP
being logged in several files. One example from /var/log/messages looks like
Jan 30 10:19:43 hadoop21 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="2451" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jan 30 10:19:43 hadoop21 ntpd[135740]: ntpd exiting on signal 1
Jan 30 10:19:43 hadoop21 init: tty (/dev/tty5) main process (134662) killed by HUP signal
...
In order to understand the issue further, I decided to try getting the PID of the process sending SIGHUP
(I am not really sure this is the ultimate info I need, but the investigation has to start from somewhere).
For achieving this, I thought about starting a simple Python script, sighup_victim.py
, and attaching strace
to it, under the assumption that the last line collected by strace
would contain the interesting info. I do it programmatically via orchestrator.py
, hence
orchestrator.py
import subprocess
p=subprocess.Popen(["python","sighup_victim.py"])
q=subprocess.Popen(["strace","-tt","-o","tracelog","-p",str(p.pid)])
If I run orchestrator.py
from a terminal and trigger the signal manually, as in $kill -SIGHUP <p.pid>
I get this in tracelog:
22:04:42.791561 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=69035, si_uid=0} ---
22:04:42.791910 +++ killed by SIGHUP +++
I interpret this as a success -- strace
can indeed report that a SIGHUP
was sent to the victim and the author of it.
Then I deploy orchestrator.py
together with a script run_orchestrator.sh
that does python orchestrator.py > /dev/null
to all machines, and trigger run_orchestrator.sh
via ssh
.
So far, in 4 out of 4 occasions in which I see the SIGHUP
wave coming, I get the sighup_victim.py
dying (as expected), but the last entry in tracelog
is
22:11:46.145040 select(0, NULL, NULL, NULL, {60, 0} <detached ...>
as if the strace
process was always killed before sighup_victim.py
. To me this coincidence is simply saying that I am not fully understanding the problem.
I am looking at alternative ways of implementing this idea (specifically using audit
), but could anyone please help me getting a better picture of what is going on, so I can learn from the mistake I am certainly doing?
Thanks!
An (even longer) description of the problem is available at Cloudera community forum.