0

I have an Ubuntu 10.04 LTS box setup as a Chef server. This was all working fine until the first time the box was rebooted, after which the following three (possibly unrelated) things happened:

  • chef-client attempted to install updates via. apt, which failed
  • The Chef webui stopped working (connection refused/timeout)
  • CouchDB and the xulrunner library it depends on stopped responding to commands - running service couchdb stop/start/status or xulrunner -v simply hang - nothing is output or added to any logs

I believe the update problem was caused by this bug: https://bugs.launchpad.net/ubuntu/+source/xulrunner-1.9.2/+bug/680570, where updating xulrunner causes a hang. I was able to get around this by restoring the box from an earlier backup (which we'll call backup A), stopping all the chef process and couchdb, installing xulrunner-dev; installing all remaining updates and then starting everything up again. At this point Chef and Couch both appeared to be working fine. I took a backup of the box in this 'working' state, which we'll call backup B.

However although the box appeared to be working, attempting to run status/restart/stop via. service couchdb caused a hang again - no output. When I rebooted the box CouchDB didn't start, and again, service couchdb start just hangs. I then restored the box from backup B, but when it boots CouchDB does not start - same issues. Nothing is added to the couchdb log file, or output if I run the command manually.

In its current state I have:

  • CouchDB: 0.10.0-1ubuntu2
  • xulrunner: 1.9.2.24+build2+nobinonly-0ubuntu0.10.04.1

If I run strace /usr/bin/couchdb the last few lines output are:

stat("/var/lib/couchdb", {st_mode=S_IFDIR|0770, st_size=4096, ...}) = 0
stat(".", {st_mode=S_IFDIR|0770, st_size=4096, ...}) = 0
open("/usr/bin/couchdb", O_RDONLY)      = 3
fcntl(3, F_DUPFD, 10)                   = 10
close(3)                                = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, {0x408189, ~[RTMIN RT_1], SA_RESTORER, 0x7f2a7ba7caf0}, NULL, 8) = 0
rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, ~[RTMIN RT_1], SA_RESTORER, 0x7f2a7ba7caf0}, NULL, 8) = 0
rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL, ~[RTMIN RT_1], SA_RESTORER, 0x7f2a7ba7caf0}, NULL, 8) = 0
read(10, "#! /bin/sh -e\n\n# Licensed under "..., 8192) = 8192
pipe([3, 4])                            = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,  child_tidptr=0x7f2a7c2069d0) = 1463
close(4)                                = 0
read(3,

...and then it hangs.

If I run strace xulrunner --gre-version the last few lines of output are:

open("/proc/cpuinfo", O_RDONLY)         = 3
mmap(NULL, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =     0x7feeee879000
open("/etc/ld.so.cache", O_RDONLY)      = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=33168, ...}) = 0
mmap(NULL, 33168, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7feeee84b000
munmap(0x7feeee84b000, 33168)           = 0
close(4)                                = 0
futex(0x7feeec0760ec, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x7feeeea980a0, FUTEX_WAIT_PRIVATE, 2, NULL

...and then it hangs.

I have also tried:

  • Setting up an ldconfig file as described here: http://wiki.apache.org/couchdb/Installing_on_Ubuntu
  • Adding the backports repos and attempting to install the later version of CouchDB (fails as the update process tries to restart couchdb, which hangs)
  • Restoring from backup A, preventing xulrunner from updating by putting a 'hold' on the package
  • Reinstalling xulrunner via. apt (fails because the reinstall process hangs)
  • Changing the couch config files to increase log level to 'debug' - still no output
  • Ensuring all the permissions and ownerships for all of the couch directories are set appropriately

Any help appreciated.

Tim Fountain
  • 53
  • 1
  • 8

2 Answers2

0

I had a weird chef-server issue a few months ago that was resolved by checking permissions on /var/log/chef and /var/run/chef or something along those lines, to make sure the various chef processes could actually write in those directories. There was an apparent hang for some minutes on service start, and then the thing failed silently.

cjc
  • 24,916
  • 3
  • 51
  • 70
  • Thanks for the response, I found an Ubuntu bug report related to this so I have actually tried this already (I've edited my question to include that info). I actually think it's the xulrunner hang which is causing the couchdb hang, and presumably xulrunner doesn't need to write anything. I also have the same problems when I run either process as root, which possibly rules out any permissions-related issues? – Tim Fountain Dec 05 '11 at 15:45
  • No idea, then. I'm actually setting up a new chef server (Lucid, installed from Opscode apt), and just did a reboot and it seemed fine. I guess the next step is to contact the more specialized forums, for chef or couch. – cjc Dec 05 '11 at 16:08
  • On your server could you try installing any apt updates, and then rebooting? And assuming everything's fine after that, run `xulrunner --gre-version` and post the output? That at least would tell me whether with the versions of everything that I have it should be possible for it to work. – Tim Fountain Dec 05 '11 at 18:50
  • $ sudo apt-get upgrade Reading package lists... Done Building dependency tree Reading state information... Done The following packages have been kept back: linux-image-ec2 0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded. $ xulrunner --gre-version 1.9.2.24 – cjc Dec 05 '11 at 19:59
0

I eventually worked out that it was another package on this system causing a conflict with xulrunner. Rebuilding the box minus the conflicting package solved the problem.

Tim Fountain
  • 53
  • 1
  • 8
  • Which package was that? Just in case I have similar issues. – cjc Dec 06 '11 at 22:28
  • appfirst, it's a third party monitoring thing we are experimenting with – Tim Fountain Dec 07 '11 at 11:36
  • Um, you should be able to remove the package without rebuilding the entire box, just saying. Try `apt-get remove appfirst` as root, that should do the trick. – Avery Payne Dec 08 '11 at 22:56
  • Obviously I did try that first. I not sure why but that package seems to corrupt xulrunner somehow. Restoring from the backup did not take long so this was an easy solution. – Tim Fountain Dec 11 '11 at 15:47