3

As of about a week ago, my cron daemon refuses to stay running. I'm using Debian 6 x64 on an OpenVZ virtual machine. Running something like pgrep cron shows that the daemon isn't running. I start the service with service cron start or /etc/init.d/cron start and it launches, but it disappears from the running process list after a few minutes (varying anywhere between 1 - 30 minutes before the process is killed again).

Using strace -f service cron start, I can see that the process is being killed for some reason:

nanosleep({60, 0},  <unfinished ...>
+++ killed by SIGKILL +++

There's nothing relevant in /var/log/syslog, /var/log/messages, /var/log/auth.log, or /var/log/kern.log to explain why the the process is dying. The system has at least 800 MB of free memory, and cat /proc/loadavg returns 0.22 0.13 0.04 so resources shouldn't be the issue. With cron running, free -m reports:

             total       used       free     shared    buffers     cached
Mem:          1024        211        812          0          0          0
-/+ buffers/cache:        211        812
Swap:            0          0          0

I also tried removing and reinstalling the cron package using apt-get.

Update: I initially thought the problem was a resource issues. I erased my entire VPS and started from a fresh Debian image. There is now nothing else running on the system, but even from a clean install my cron daemon is still being killed at random.

What else should I check? How do I find out what's killing my crond?

quietmint
  • 207
  • 2
  • 10
  • For completeness, what's the output of `free -m` (concerned that your 500MB free is happening after it's killed and you have no swap) and is this a physical or virtual machine? If virtual, what hypervisor? – Jay Jun 29 '12 at 00:09
  • Ah, good point. It's an OpenVZ VM. I updated the question. – quietmint Jun 29 '12 at 00:23
  • I'm having the exactly same issue after reinstalling a VPS under OpenVZ with vePortal, Debian 5 x86_64 and DirectAdmin. In my case I have no control of the host because it's on a VPS hosting provider. The counters at `/proc/user_beancounters` doesn't reset after virtual host reboots. – Martín Claro Jul 13 '12 at 03:57

2 Answers2

4

Look at /proc/user_beancounters, more specifically, at the failcnt column.

For all the non-zero entries, you'll need to increase the barrier/limit accordingly, it's probably just OpenVZ killing your processes for hitting them.

Here is a description of each column: http://wiki.openvz.org/Proc/user_beancounters

For accountable parameters, the field held shows the current counter for the container (resource “usage”), and the field maxheld shows the counter's maximum for the last accounting period. The accounting period is usually the lifetime of the container.

The field failcnt shows the number of refused “resource allocations” for the whole lifetime of the process group.

The barrier and limit fields are resource control settings. For some parameters, only one of them may be used, for some parameters — both. These fields may specify limits or guarantees, and the exact meaning of them is parameter-specific. Description of each parameter in UBC parameters contains information about the difference between the barrier and the limit for the parameter.

Jay
  • 6,544
  • 25
  • 34
  • Thanks! This is very useful. Most things look in line to me, except I see `privvmpages` has failed about 4 million times since the start of the accounting period. :-) I stopped and restarted my VM and the failcounts didn't reset -- so I don't know how long my accounting period is (since the VM was created?). However, I'm not sure if this is the cause of my cron mystery. Shouldn't I see one of the failcounts increase after each death of crond? – quietmint Jun 29 '12 at 01:02
  • If it answers your question, please click the green hollow tick on the left of my answer to accept it :-) – Jay Jun 29 '12 at 01:04
  • Sorry, I forgot I was in a comment and was premature in my use of the Enter key. :-) – quietmint Jun 29 '12 at 01:06
  • Heh, fair enough :-) In my experience `failcnt`s definitely reset after a VM restarts, so I'm not sure why yours didn't. `failcnt` measures the number of times the allocations were *refused* (e.g. if you write a C program to `malloc()` memory, if that calls fails, then only `failcnt` will increment). OpenVZ may have killed your application because it needed to maintain a different one (e.g. it may judge that killing `crond` would allow all other applications to function correctly, so it's most optimal to kill it). – Jay Jun 29 '12 at 01:14
  • Ah, okay, so it may have sent the kill signal to `crond` in order to _avoid_ "failure" (because killing it would allow another process's memory allocation request to proceed)? Good thing OpenVZ is there to save the day ;-) – quietmint Jun 29 '12 at 01:22
  • Precisely, the kernel can cheat by allocating memory to anyone who asks, then just killing the process it feels would be best so the rest can use their allocated share. Indeed! :-) – Jay Jun 29 '12 at 01:28
  • While this initially was a possibility, I don't think this is the case. I erased the VPS, installed a fresh copy of Debian and ran `apt-get` to `update` and `upgrade` the packages, and even with a clean install like this the problem persists. I initially suspected something in the `cron` APT package changed recently, but [Debian's change log](http://packages.debian.org/changelogs/pool/main/c/cron/cron_3.0pl1-116/changelog) shows the last modification was in Nov 2010. I'm back to square one. – quietmint Jul 08 '12 at 14:06
  • `SIGKILL` is the strongest kill signal that can be issued. Are you sure you don't have a script on a host that is accidentally terminating `crond` processes (processes within VEs will appear as processes on the host, under a normal `ps`). If not, how about the logs on the host? Anything there? `dmesg`? – Jay Jul 08 '12 at 14:42
  • This is a hosted VPS where I have no control over the host. – quietmint Jul 13 '12 at 22:48
0

After much trial and error, I have stumbled on a workaround. For some reason, cron is only being killed (presumably by the host) if it's running in daemon mode. If launched with cron -f, the process persists. So, I created a simple script to launch it in the foreground (and continually relaunch it in the unlikely event that it does get killed):

#!/bin/bash -u
while [ 1 ]; do
        logger -i -t cronrestart -p cron.warn "Launching cron daemon"
        cron -f
        logger -i -t cronrestart -p cron.warn "Cron daemon killed"
done
logger -i -t cronrestart -p cron.warn "Quitting"

Then I start this launcher script via nohup cronrestart >/dev/null & so it runs in the background. Cron has been running for a week so far this way without being killed.

I suppose the next step would be to try making the launcher script start itself in the background to better simulate the daemon mode of cron.

quietmint
  • 207
  • 2
  • 10