1

For mysterious reasons the machines that build our Hadoop cluster seem to experience waves of SIGHUP. All boxes are running centos 6.7/8 and Cloudera (CM+CDH) 5.9.

When such SIGHUP wave occurs in one machine, I see processes getting stuck (some from Hadoop, some native to the OS like ntpd), and traces of SIGHUP being logged in several files. One example from /var/log/messages looks like

Jan 30 10:19:43 hadoop21 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="2451" x-info="http://www.rsyslog.com"] rsyslogd was HUPed  
Jan 30 10:19:43 hadoop21 ntpd[135740]: ntpd exiting on signal 1  
Jan 30 10:19:43 hadoop21 init: tty (/dev/tty5) main process (134662) killed by HUP signal  
...

In order to understand the issue further, I decided to try getting the PID of the process sending SIGHUP (I am not really sure this is the ultimate info I need, but the investigation has to start from somewhere).

For achieving this, I thought about starting a simple Python script, sighup_victim.py, and attaching strace to it, under the assumption that the last line collected by strace would contain the interesting info. I do it programmatically via orchestrator.py, hence

orchestrator.py

import subprocess
p=subprocess.Popen(["python","sighup_victim.py"])
q=subprocess.Popen(["strace","-tt","-o","tracelog","-p",str(p.pid)])

If I run orchestrator.py from a terminal and trigger the signal manually, as in $kill -SIGHUP <p.pid> I get this in tracelog:

22:04:42.791561 --- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=69035, si_uid=0} ---
22:04:42.791910 +++ killed by SIGHUP +++

I interpret this as a success -- strace can indeed report that a SIGHUP was sent to the victim and the author of it.

Then I deploy orchestrator.py together with a script run_orchestrator.sh that does python orchestrator.py > /dev/null to all machines, and trigger run_orchestrator.sh via ssh.

So far, in 4 out of 4 occasions in which I see the SIGHUP wave coming, I get the sighup_victim.py dying (as expected), but the last entry in tracelog is

22:11:46.145040 select(0, NULL, NULL, NULL, {60, 0} <detached ...>

as if the strace process was always killed before sighup_victim.py. To me this coincidence is simply saying that I am not fully understanding the problem.

I am looking at alternative ways of implementing this idea (specifically using audit), but could anyone please help me getting a better picture of what is going on, so I can learn from the mistake I am certainly doing?

Thanks!
An (even longer) description of the problem is available at Cloudera community forum.

  • `SIGHUP` is sent whenever the user hangs up (hences "HUP") the phone to which his modem is connected. In modern terms, any program that is running in the background when SSH exits will receive SIGHUP. – Robᵩ Jan 30 '17 at 22:38
  • 1
    See also: https://linux.die.net/man/1/nohup – Robᵩ Jan 30 '17 at 22:39
  • Thanks for appreciated feedback! i) as far as I can understand, that's just one utilization of SIGHUP [link](http://stackoverflow.com/questions/19052354/sighup-for-reloading-configuration). ii) what happens is, imHo, that `ssh` keeps the session open for the time it takes to run `orchestrator.py`, then both completes. However, `popen` ensures `sighup_victim.py` survives [link](https://docs.python.org/2/library/subprocess.html) and this I can actually see. iii) I gave a try to `sighup` for `strace` already, but I was calling `Popen` with `shell=False`, I'll try with `shell=True` instead – fdeltedesco Jan 31 '17 at 14:17

1 Answers1

0

Is there anything in /var/log/kern.log or dmesg? Does it happen on a single machine?

As Rob says, a SIGHUP is far different than a OOM kernel kill, say, which would be a SIGKILL.

  • Thanks for appreciated feedback. As far as I can tell, our Centos 6.7 config forward everything to `var/log/messages` [link](https://www.digitalocean.com/community/tutorials/how-to-view-and-configure-linux-logs-on-ubuntu-and-centos), an excerpt of it is contained already in the original question. `dmesg` does not seem to be more informative. The issue occurs across several machines, as said initially. – fdeltedesco Jan 31 '17 at 14:31