2

I've been a long time lurker of the site but this is my first question. So please let me know if there are any issues with my post.

Two of the servers in our Ubuntu server farm (25+ machines) take a long time (10+ mins) to restart the syslog-ng service. All machines have the same version of syslog-ng (3.5.3). After doing an strace on the service, the following syscall is where the process hangs (lines before and after added for context):

poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, 4294967295) = 1 ([{fd=3, revents=POLLIN|POLLHUP}]) <0.000248>

recvfrom(3, "", 8, MSG_WAITALL, NULL, NULL) = 0 <0.000005>

poll([{fd=4, events=POLLIN}], 1, 4294967295 * Starting system logging syslog-ng [ OK ]) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) <841.792219>

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24591, si_status=0, si_utime=0, si_stime=0} --- write(5, "\21", 1) = 1 <0.000008>

rt_sigreturn() = -1 EINTR (Interrupted system call) <0.000005>

poll([{fd=4, events=POLLIN}], 1, 4294967295) = 1 ([{fd=4, revents=POLLIN}]) <0.000008>

This is the result of a simple "sudo service syslog-ng restart", no other flags or options. Not sure what signal is interrupting the poll(). All other machines restart the service in a matter of seconds.

I can't figure out if this is an issue with syslog-ng, or something in the configuration of these machines. Usual Google-fu led me nowhere. Can anyone shed any light on how to troubleshoot this issue?

Thanks in advance!

EugeneRomero
  • 125
  • 7
  • Have you performed any resource monitoring/trending while this is going on? – EEAA Aug 28 '15 at 20:26
  • Resources seem fine. Top doesn't show unnatural cpu usage, and ram stays consistent. – EugeneRomero Aug 28 '15 at 22:02
  • I think the output you've shown is a mixture of strace output and output from the startup script. Could you try this again sending strace output to a file? What is on file descriptor 4? My first guess would be an issue with DNS resolution. – Paul Haldane Aug 29 '15 at 12:34
  • @PaulHaldane I split strace's output from the script's but the result didn't give me any new info (it's basically the above minus the "Starting syslog-ng..." bit). Extra odd, however, is the fact that, magically over the weekend, this started working. Makes no sense, hah. I know no one else has been working on these cause they're not production servers and the other 2 sys managers are on vacation. But after 2 weeks of misbehaving, it magically solved itself. I'm gonna have to do a bit of stress testing, see if I can reproduce it again. – EugeneRomero Aug 31 '15 at 15:27

1 Answers1

0

Unfortunately, I cannot help in tracking your original problem down (maybe an AppArmor issue? a loose guess...). But I've encountered a different but related problem with Syslog-ng in the past - it used to crash randomly in the night for reasons very hard to fix.

Originally, Ubuntu's package contains classic, SysV-type init script, which cannot restart crashed services. I've written and successfully use a native Upstart job for this. Since it completely changes the way the daemon is started, chances are this will work your problem around, with the autorestart-on-crash bonus added.

If you want to use this, stop the service completely (make sure it's not running detached and use kill -s TERM ... if it does), save a job file to /etc/init/syslog-ng.conf, make /etc/init.d/syslog-ng a symlink to /lib/init/upstart-job. Then sudo initctl reload-configuration, and finally service syslog-ng start.

I believe that only lack of "manpower" prevented Ubuntu from including a proper Upstart job for almost any service.

sam_pan_mariusz
  • 2,133
  • 1
  • 14
  • 15
  • 1
    Thanks for your workaround. The lack of an Upstart job had been bothering me anyways, so this solves an extra problem for me :D Just one detail: when setting this up, I noticed the job would claim it had started but when checking the status, it would be "stopped/waiting". So after setting up debugging, I noticed it wouldn't start because of permission problems on /var/lib/syslog-ng/ . Doing a `sudo chown syslog:syslog /var/lib/syslog-ng/` took care of that. Thanks again! I have accepted and upvoted your answer. – EugeneRomero Aug 31 '15 at 19:34
  • Good point with the dir permissions. I often add basic permission enforcing to pre-start phase in my Upstart jobs, to avoid this kind of problems. Now, knowing about a possible problem with this dir, I'm gonna update the job. – sam_pan_mariusz Sep 01 '15 at 06:24
  • Updated the job file, check current link. Includes a fix for non-standard control socket path. – sam_pan_mariusz Sep 01 '15 at 22:00