I have Jenkins running a deploy script on all of our app machines. Lately, half my builds don't finish and keep hanging while trying to run the same thing. The last of the output looks like this:
** [app@app1 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app2 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app3 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app4 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app6 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
** [app@app7 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
app5 is always the machine that seems to have this problem, and it occurs when it tries to run:
/usr/local/bin/ruby /usr/local/bin/bundle exec rake db:migrate ts:conf
Production is running ruby 1.9.3p194, and due to legacy reasons we're still running ThinkingSphinx v. 0.9.8. We're also running Rails 3.2.13 and ThinkingSphinx 2.0.7.
Running strace on the hanging process shows me this:
...
29802 select(4, [3], NULL, NULL, NULL <unfinished ...>
29790 restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection timed out)
29790 futex(0x64a88e8, FUTEX_WAKE_PRIVATE, 1) = 0
29790 write(4, "!", 1 <unfinished ...>
29802 <... select resumed> ) = 1 (in [3])
29790 <... write resumed> ) = 1
29802 read(3, <unfinished ...>
29790 futex(0x1d47f64, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
29802 <... read resumed> "!", 1024) = 1
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
...
Has anyone ever seen this before? Not having too much of a background in sysops, is there a specific approach I should be taking in trying to tackle this problem?