How to troubleshoot sporadic crash due to garbage collection in Ruby

Question

I have a Ruby v2.3.4 application based on Grape v0.19.2.

Recently, after our last deployment, we noticed that the system shut-down and our god v0.13.7 process monitor started it back up again. After looking at the crashlogs, we're seeing 20-30 crashes a week.

Here are some sample crash reports:

/.rvm/gems/ruby-2.3.4/gems/bson-4.2.1/lib/bson/hash.rb:80: [BUG] rb_gc_mark(): 0x007fa2f4fb33f0 is T_NONE
/.rvm/gems/ruby-2.3.4/gems/mongo-2.4.1/lib/mongo/socket.rb:176: [BUG] rb_gc_mark(): 0x007f990c383360 is T_NONE
/.rvm/gems/ruby-2.3.4/gems/activesupport-5.1.1/lib/active_support/callbacks.rb:102: [BUG] rb_gc_mark(): 0x007ffbeb9e3880 is T_NONE

These crashes seem to happen randomly and can be 5-7 days apart or several will happen in an hour. The stacktraces in the crash logs aren't very helpful and show basically everything we're running.

Currently our strategy has been to roll back our entire code base and look at all the changes that went in, but they are very numerous. The dependencies on 30-40 updated gems also changed. Since the crashing appears to be random, it's very difficult to test if a change to the code or gem has fixed the issue.

This issue appears to be garbage collection related, so I tried using GC in debug mode to see if that could help us create a reproducible case, but the application would take orders of magnitude longer to startup and run so that strategy wasn't viable.

What would be a good strategy to force a crash so we can narrow down whether the problem came from our code update or a dependent gem?

Seriously, with a 2.3.4 app, you might just be encountering a fixed bug or security flaw. You might better invest your time updating your Rails version, first (I know this can be monumental). Not an answer, but food for thought... Good luck! — Brad Werth, Jun 05 '17 at 16:24
Yes, we plan on bumping ruby to the latest. We aren't running Rails though, but we are going to try bumping all dependencies to the latest as a "shotgun" approach. — arcdegree, Jun 05 '17 at 16:32
Sorry, I completely misread that... I would have less hope for the Ruby update, but you never know, probably worth a shot... Gotta love the unreproducible stuff... — Brad Werth, Jun 05 '17 at 16:51
We are in the process, but it's going to take quite a bit of time since there are hundreds of gems and lots of testing that needs to take place. I'll report my findings when it's complete. — arcdegree, Jun 05 '17 at 20:15
Updating to 2.4.1 did not help. Application still crashes. We're looking into reverting everything at this point. — arcdegree, Jun 13 '17 at 18:28

score 0 · Answer 1 · answered Jun 14 '17 at 04:54

I haven't debugged this sort of problem in Ruby, but I can give some general advice. As you've discovered, Ruby's mark and sweep garbage collection can be unpredictable. It can be immensely frustrating to debug memory issues in the best of times, but it's even harder to debug when you can't reliably reproduce bugs. Fortunately, there are some things to dig into the problem.

To start, garbage collection bugs are often associated with large memory allocation. Either the GC is getting tripped up by many objects or there's a bug that leaks memory. In either case, use GC.stat to gather information about the memory state at various points in processing. If you see memory allocation growing, you can start to narrow down the range of possible problems with a memory profiler such as this one. If you can find where memory is getting eaten, you have a place to start. Maybe avoid pulling in a gem that's causing problems or changing the way you store data.

Next, consider adjusting garbage collection parameters. This won't help you find the cause of the bug, but it could prevent it from occurring. A fairly extensive survey of the output of GC.stat and the meaning of the environment variables that can be adjusted may be found in this post.

It can also help to manually initiate GC. If there's a point in processing where a little slowdown won't hurt, collect garbage to keep the heap at a manageable level. Obviously, this could mask the underlying problem and slow down your application. So use with caution. But if you can't find the root cause of the error, it's better to avoid crashing or perhaps make the crash more reproducible.

Finally, you might try an alternative malloc implementation. If you don't mind building Ruby from source, it's very easy to swap out the default malloc provided by your C compiler and try something else. Again, it might mask the underlying problem.

How to troubleshoot sporadic crash due to garbage collection in Ruby

1 Answers1