4

I run the perl script in screen (I can log in and check debug output). Nothing in the logic of the script should be capable of killing it quite this dead.

I'm one of only two people with access to the server, and the other guy swears that it isn't him (and we both have quite a bit of money riding on it continuing to run without a hitch). I have no reason to believe that some hacker has managed to get a shell or anything like that. I have very little reason to suspect the admins of the host operation (bandwidth/cpu-wise, this script is pretty lightweight).

Screen continues to run, but at the end of the output of the perl script I see "Killed" and it has dropped back to a prompt. How do I go about testing what is whacking the damn thing?

I've checked crontab, nothing in there that would kill random/non-random processes. Nothing in any of the log files gives any hint. It will run from 2 to 8 hours, it would seem (and on my mac at home, it will run well over 24 hours without a problem). The server is running Ubuntu version something or other, I can look that up if it matters.

John O
  • 283
  • 3
  • 15
  • Running out of memory? Check the logs for oom killer messages. – mattdm Feb 08 '11 at 18:13
  • Is it a script that runs in the current PTY, or is it written to daemonize itself? Even if you run some programs that daemonize from a PTY session, they will die when you log off. Try putting `nohup` before your command to run your script. Alternately, you could add a script in `/etc/init.d` to launch your script, which you could then start and stop with `sudo service _name-of-Perl-script_ [start|stop|restart]`. – nesv Feb 08 '11 at 18:16
  • Not written to daemonize itself. Just outputs some monitoring data via the Curses mod. That's why I'm using screen at the moment, to avoid that. – John O Feb 08 '11 at 18:19
  • Just checked the logs again. Nothing in messages, debug, dmesg, kern.log, or syslog. Am I missing any, should I be grepping for something specific? I mean, I don't think I'm using that much memory in the script... but there could be some leak I'm not recognizing. – John O Feb 08 '11 at 18:25
  • Do the hosts have root on the server? Adding the signal traps below would be the best first step, and you should add timestamp logging if you don't have it already to make sure you know exactly how long the script is living and when it is being killed. If you get that far with no answer then more logging, and/or external monitoring will be the next step. – daveadams Feb 08 '11 at 18:47
  • Just look for the date it was killed. Add the date (`\t`) to your PS1 if you don't have it already. /var/log/syslog and auth.log should be enough. – Tobu Feb 08 '11 at 18:50

7 Answers7

5

Put in signal handlers for all the signals (TERM, SEGV, INT, HUP, etc) and have them log out whenever they are hit. It wont tell you what is sending the signal, but it will allow you to see what signal it is and perhaps ignore it.

$SIG{'TERM'} = $SIG{'INT'} = sub { print(STDERR "Caught SIG$_[0]. Ignoring\n"); };

That would print out when it caught a sigterm or sigint and then return control back to the program. Of course with all those signals being ignored, the only way to kill it would be to have the program itself exit, or to send it a SIGKILL which cant be caught.

phemmer
  • 5,909
  • 2
  • 27
  • 36
  • I added the signal handling last night, per your suggestions. It's certainly harder to stop it for debugging now... and we can rule out anything quite so simple now. – John O Feb 09 '11 at 16:06
3

I realize this isn't exactly an answer to the question you asked, so I apologize if it's somewhat off-topic, but: does your app really need to run continuously, forever? Perl is not the most resource-thrifty environment in the world, and while the overhead of interpreter start-up is not without its drawbacks, extremely long-running scripts can have troubles of their own - memory leaks, often at a level below your control, are the bane of the vanilla-perl developer's existence, which is why folks often mitigate those issues either by running in a more formally resource-conservationist sub-environment like Perl::POE, or by handing over the long-running listener part of the job to a front-end service like xinetd and only executing the perl component when work needs to be done.

I run several perl scripts which run continuously reading and processing the output of our (considerably large) central syslog stream; they suffer from terrible, inexplicable "didn't free up memory despite pruning hash keys" problems at all times, and are on the block to be front-ended by something better suited to continuous high-volume input (an event queue like Gearman, for example), so we can leave perl to the data-munging tasks it does best.

That went on a bit; I do apologize. I hope it's at least somewhat helpful!

Jeff Albert
  • 1,987
  • 9
  • 14
  • Work needs to be done once a second, by requesting time-sensitive json off a third party webserver. That data is chucked into a database for analysis, and we think it might be worth as much as $5000 a month. We want to keep the opportunity to ourselves, so I can't say much more than that... but if we can get a month or two's worth of data, more or less continuous, we should have what we need. We'll deal with the longer term memory leak problems by periodically restarting it should we need that. – John O Feb 08 '11 at 22:17
2

Without much in the way of actual knowledge, I'd start looking in dmesg output or assorted syslogs if the OOM killer is running. If so, that's probably it.

Vatine
  • 5,440
  • 25
  • 24
2

Syslog is the first thing to consult. If it isn't sufficient…

You can't determine who sends a signal to a process. It could be another process, it could be the kernel, etc. Short of involving the very recent perf framework some guesswork is involved.

However, you can set up some better monitoring. The atop package, in debian/ubuntu, sets up a service that will log system load and per-process activity (disk, memory, cpu). You can then consult those logs and get a feel of what was happening at the time the process crashed.

Crash course: sudo atop -r, navigate with t and T, type h to get help about the various visualisations.

Also consider adding a signal handler that dumps pstree to a temporary file.

Tobu
  • 4,437
  • 1
  • 24
  • 31
2

Likely you are running into resource limits. For example CPU time. Try ulimit -a to check. If it's only a soft-limit, set in a login script then you can fix it with, eg, ulimit -t unlimited. If it's a hard limit, as is set for example for regular users on OpenBSD and other OSs, then you'll have to override.

Lev Bishop
  • 121
  • 3
1

Until you nail the issue, running the script with

nohup scriptname

can help. If it still crashes, examine the nohup.out file.

And if nothing mentioned here helps, I'd try to use strace/ltrace to see what system or library calls script was doing before failure, but they generate a LOT of output.

Juraj
  • 257
  • 3
  • 9
1

In a previous life I found a DEC Ultrix box that had a very clever cron job which looked for all processes with more than 1 CPU hour and killed them. Which was why the nightly batch report job died every night.

Any clever cron jobs/scripts that might be killing it? Or it might be another performance tuning parameter or somethng like the ulimit answer already given.

jqa
  • 451
  • 2
  • 7