1

A few days ago all 4 of my app servers started having issues. It came after I deployed some code, but all I did was update a local database file that stores some IP Addresses, so I didn't make any actual code changes. It seems like right around that time, my ruby processes are now getting out of hand. They will be fine for a while, then all of the sudden they quickly climb to 100% CPU on one CPU. Since i'm using passenger, eventually another thread will do the same thing, and max out another CPU, and so on and so on till the web server can no longer handle traffic and stops responding.

I've done a lot of digging (which I am not great at), but i've at least found that when running an strace on the processes, they look pretty normal to start, and then when they go crazy as described above, its just a non-stop flood of clock_gettime(CLOCK_REALTIME, {1518938625, 9566131}) = 0 calls. The normal process like I said is not constantly spitting out stuff, only when a web request comes in for example, but then something sets it off and it just goes nuts till I kill the process, or restart passenger, or reboot the server. Then within an hour or two its back to having issues again.

I've been at it for a couple of days babysitting, non-stop restarting stuff to keep it limping along, but I am desperate for some ideas. I've noticed a couple of really old posts from like 2013 that talk about a 100% cpu issue with this clock_gettime thing, and i've tried both suggestions that are associated with the few posts i've seen. One is setting a TZ variable, and the other is supposed to fix a leapsecond bug of some sort. I don't understand the reasoning behind either of the proposed fixes, but sadly they did not work.

I am running the following stack: ruby 2.2.0 Passenger standalone: Gem Version: 4.0.58 (and tried upgrading to 5.2.0 on one server with no change in behavior) MySQL CentOS 6.9

Sean
  • 111
  • 1
  • 1
    Take care of the obvious stuff first: Remember that `strace` only shows system calls, but your program might be in an infinite loop; use `ltrace` for possibly more detail. Get rid of passenger and use a proper rails web server like unicorn, puma or thin. Upgrade to CentOS 7. – Michael Hampton Feb 18 '18 at 07:50
  • Thanks, I hadn't heard of ltrace before. Its not really showing much. While strace is showing all those clock_gettime, the ltrace showed just a couple lines like this: +++ exited (status 0) +++ Should it be able to show actual application code? I am in process of building and testing a new stack with nginx and puma. Its going to take some time though, so I would love to solve this issue in the meantime. Thanks! – Sean Feb 18 '18 at 09:55
  • Neither will trace within your application code. For that you'll want something Rails-specific. And that's a bit outside our scope here. Check [so] or Google. – Michael Hampton Feb 18 '18 at 19:14

0 Answers0