1

I am running rake test in a Docker container, where I need to limit the container's max-memory. This has been achieved by passing the -m 1800m flag to docker run command. Unfortunately, the rake process seems to keep growing wrt memory usage, and is finally killed mid-way due to some sort of OOM killer. I've tried putting the following in my test_helper.rb file...

class ActiveSupport::TestCase
    teardown :perform_gc

  def perform_gc
    puts "==> Before GC: #{ObjectSpace.count_objects}"
    GC.start(full_mark: true)
    puts "==> Before GC: #{ObjectSpace.count_objects}"
  end
end

...and am getting this output during my test run.

==> Before GC: {:TOTAL=>1067513, :FREE=>798, :T_OBJECT=>146231, :T_CLASS=>13450, :T_MODULE=>3350, :T_FLOAT=>10, :T_STRING=>448040, :T_REGEXP =>3951, :T_ARRAY=>156744, :T_HASH=>32722, :T_STRUCT=>2162, :T_BIGNUM=>8, :T_FILE=>266, :T_DATA=>113834, :T_MATCH=>20339, :T_COMPLEX=>1, :T_RATIONAL=>59, :T_NODE=>115807, :T_ICLASS=>9741}
==> After GC: {:TOTAL=>1067920, :FREE=>304019, :T_OBJECT=>92774, :T_CLASS=>13431, :T_MODULE=>3350, :T_FLOAT=>10, :T_STRING=>328707, :T_REGEXP=>3751, :T_ARRAY=>107523, :T_HASH=>25206, :T_STRUCT=>2023, :T_BIGNUM=>7, :T_FILE=>11, :T_DATA=>112605, :T_MATCH=>11, :T_COMPLEX=>1, :T_RATIONAL=>59, :T_NODE=>64713, :T_ICLASS=>9719}

... test result of test #1 ....

==> Before GC: {:TOTAL=>1598233, :FREE=>338182, :T_OBJECT=>111209, :T_CLASS=>15057, :T_MODULE=>3481, :T_FLOAT=>10, :T_STRING=>570289, :T_REGEXP=>4836, :T_ARRAY=>219746, :T_HASH=>54358, :T_STRUCT=>12047, :T_BIGNUM=>8, :T_FILE=>12, :T_DATA=>138031, :T_MATCH=>2600, :T_COMPLEX=>1, :T_RATIONAL=>389, :T_NODE=>117993, :T_ICLASS=>9984}
==> After GC: {:TOTAL=>1598233, :FREE=>653201, :T_OBJECT=>103708, :T_CLASS=>14275, :T_MODULE=>3426, :T_FLOAT=>10, :T_STRING=>418825, :T_REGEXP=>3773, :T_ARRAY=>137405, :T_HASH=>39734, :T_STRUCT=>7444, :T_BIGNUM=>7, :T_FILE=>12, :T_DATA=>128923, :T_MATCH=>12, :T_COMPLEX=>1, :T_RATIONAL=>59, :T_NODE=>77590, :T_ICLASS=>9828}

... test result of test #2 ....


==> Before GC: {:TOTAL=>1598233, :FREE=>269630, :T_OBJECT=>114406, :T_CLASS=>14815, :T_MODULE=>3470, :T_FLOAT=>10, :T_STRING=>611637, :T_REGEXP=>4352, :T_ARRAY=>248693, :T_HASH=>58757, :T_STRUCT=>12208, :T_BIGNUM=>8, :T_FILE=>25, :T_DATA=>139671, :T_MATCH=>2288, :T_COMPLEX=>1, :T_RATIONAL=>83, :T_NODE=>108278, :T_ICLASS=>9901}
==> After GC: {:TOTAL=>1598233, :FREE=>635044, :T_OBJECT=>105028, :T_CLASS=>14358, :T_MODULE=>3427, :T_FLOAT=>10, :T_STRING=>429137, :T_REGEXP=>3775, :T_ARRAY=>140654, :T_HASH=>41626, :T_STRUCT=>8085, :T_BIGNUM=>7, :T_FILE=>12, :T_DATA=>129507, :T_MATCH=>15, :T_COMPLEX=>1, :T_RATIONAL=>59, :T_NODE=>77631, :T_ICLASS=>9857}

... test result of test #3 ....

... and so on ....

The value of ObjectSpace.count_objects[:TOTAL] is constantly growing after each test!

1067920 (after GC is run at the end of 1st test)
5250321 (after GC is run at the end of 18th test)
8631313 (after GC is run at the end of last, but ten, test)
8631313 (this number remains the same for the next 10 tests) 
8631313 (after GC is run at the end of last, but three, test)
8631313 (after GC is run at the end of last, but two, test)
8631313 (after GC is run at the end of last, but one, test)
8631721 (after GC is run at the end of last test, after which rake aborts)

I'm also monitoring the process's memory usage via docker stats and ps aux --sort -rss. The memory consumption stabilises around the 1.77 GB mark, which is dangerously close to the 1800 mb limit set for the container. This is also validated by the value of ObjectSpace.count_object[:TOTAL] not changing for the last 10-15 test before rake is killed/aborted.

The error message in docker logs at the time of crash/abort/kill is:

PANIC:  could not write to log file 00000001000000000000000A at offset 14819328, length 16384: Cannot allocate memory
LOG:  unexpected EOF on client connection with an open transaction
LOG:  WAL writer process (PID 262) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2018-02-26 06:35:38 UTC

How do I get to the bottom of this and ensure that my tests run in constant memory?

Saurabh Nanda
  • 6,373
  • 5
  • 31
  • 60
  • Might be a case of https://stackoverflow.com/questions/20385767/finding-the-cause-of-a-memory-leak-in-ruby/20608455#20608455 – Aleksei Matiushkin Feb 26 '18 at 07:07
  • Are the tests ordered, or do they run in random order? Are you able to replicate the issue with certainty using a particular order? This might actually require reading through the code line by line. – Swanand Feb 26 '18 at 07:13
  • The tests are ordered. I am actually splitting the test files into N chunks and running N docker containers for each chunk of test files (hence the requirement to limit the docker container's max-memory). Every container is consistently growing out of the 1.8GB limit. – Saurabh Nanda Feb 26 '18 at 07:15
  • As another data point, I am monitoring the behaviour of the rake process in CircleCI as well (the original question is a custom build pipeline on Google Cloud). Even on CircleCi, the rake process continuously keeps growing in memory. Since the time I've been monitoring it, it has grown from ~600MB to ~1,400MB! – Saurabh Nanda Feb 26 '18 at 07:17
  • @mudasobwa even if Ruby doesn't release memory back to the OS, it keeps an internal track of what is being using and what is actually available for reallocation, right? Why does it constantly keep acquiring more memory from the OS after every test? – Saurabh Nanda Feb 26 '18 at 07:27
  • A couple of question that should help you get at the cause, you might have already considered some or all of these: 1. Is the object growth linear? Anything that jumps out of pattern? 2. Does enabling stress mode help? `GC.stress = true` ( Major performance degradation, but helps debugging ) 3. What is the database cleaner strategy? Is there any problem there? 4. Are you using @instance variables in tests? Any shared state? – Swanand Feb 26 '18 at 07:36
  • There are tonnes of `@instance` variables in tests. The setup block in all the test files create a bunch of `@instance` variables. Won't they be cleared after the test is over? – Saurabh Nanda Feb 26 '18 at 07:39
  • Yeah, `@instance` variables in the setup shouldn't be an issue. However, I would make sure that the methods you are using to build those variables are not memoizing the data, or storing that data for multiple tests. A type of variable to watch out for are class variables. They can either defined like `@@name` within an instance or `@name` within the class. – Cody Gustafson Feb 26 '18 at 16:35
  • Is there any gem / command-line tool which can help me locate this memory leak in my code? – Saurabh Nanda Feb 27 '18 at 04:08

0 Answers0