0

I have Jenkins running a deploy script on all of our app machines. Lately, half my builds don't finish and keep hanging while trying to run the same thing. The last of the output looks like this:

 ** [app@app1 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
 ** [app@app2 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
 ** [app@app3 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
 ** [app@app4 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
 ** [app@app6 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf
 ** [app@app7 :: stdout] Generating Configuration to /var/www/app/releases/20130509192657/config/production.sphinx.conf

app5 is always the machine that seems to have this problem, and it occurs when it tries to run:

/usr/local/bin/ruby /usr/local/bin/bundle exec rake db:migrate ts:conf

Production is running ruby 1.9.3p194, and due to legacy reasons we're still running ThinkingSphinx v. 0.9.8. We're also running Rails 3.2.13 and ThinkingSphinx 2.0.7.

Running strace on the hanging process shows me this:

...
29802 select(4, [3], NULL, NULL, NULL <unfinished ...>
29790 restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection timed out)
29790 futex(0x64a88e8, FUTEX_WAKE_PRIVATE, 1) = 0
29790 write(4, "!", 1 <unfinished ...>
29802 <... select resumed> )            = 1 (in [3])
29790 <... write resumed> )             = 1
29802 read(3,  <unfinished ...>
29790 futex(0x1d47f64, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...>
29802 <... read resumed> "!", 1024)     = 1
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
29802 select(4, [3], NULL, NULL, {0, 100000}) = 0 (Timeout)
...

Has anyone ever seen this before? Not having too much of a background in sysops, is there a specific approach I should be taking in trying to tackle this problem?

Eric R.
  • 933
  • 1
  • 9
  • 19
  • Just to be clear: are you using Sphinx 0.9.8? If so, which version of Thinking Sphinx are you using? And which version of Rails? – pat May 09 '13 at 20:12
  • Yep. ThinkingSphinx 2.0.7 and Rails 3.2.13. – Eric R. May 09 '13 at 20:17
  • I'm not sure it's related to Thinking Sphinx or not, given you're not seeing any output from app5. Can you run the rake task manually on app5 and when it hangs, hit control-C and see where the stack trace is at? – pat May 09 '13 at 20:22

1 Answers1

0

If a db:migrate is locking, then you may have an active, or hung - perhaps zombie - process locking a database table (or other resource) being referred in the migration. I have experienced this recently where a data fix-up script ran by another engineer had crashed (over a week before the deployment I was attempting), but not exited - holding an open transaction that prevented alteration to a table. The fix for us was simply to terminate the stuck process, and the migrate then worked as normal.

Without knowing your system architecture, it's difficult to be precise about what the conflicting resource could be. Your rdbms toolkit may allow you to take a look at db hosted on the server and see what the open connections are up to.

Neil Slater
  • 26,512
  • 6
  • 76
  • 94