4

From a data supplier I download roughly 75 images + 40 pages of details in one job using RestClient.

Goes like this:

  1. Authenticate to suppliers service and set cookie jar in variable
  2. Download XML
  3. XML contains roughly 40 assets.
  4. For each asset download list of images. (Spans from 0-10 images per asset).
  5. Download images.

My total download size is 148.14Mb in 37.58 seconds through 115 unique requests. My memory consumption is:

Total allocated: 1165532095 bytes (295682 objects)
Total retained:  43483 bytes (212 objects)

measured with memory_profiler gem. That's just above 1gb of memory to download ~150mb of data?

My big concern is, that I need to download even more data - this is just 1 out of 15 days of data. When I run 2 days of data I double the download size and memory size. When running 3 days of data I triple etc. It even looks like the memory consumption raises exponential until I run out of memory and my server crashes.

Why is Garbage Collection not kicking in here? I've tried running GC.start between each day of data I download, that tricks memory_profiler, but my server still ends up crashing when I add too many days of data.

So my question is:

  1. Why is the memory consumption so high compared to the data I'm actually downloading.
  2. As I'm overwriting the variables holding the downloaded data between each download, should Garbage Collection then not clear the memory of the former data download?
  3. Any tips and tricks to keep memory consumption down?

Versions: Ruby: 2.4.4p296, RestClient: 2.0.2, OS: Ubuntu 16.04

Example code:

Using RestClient: https://gist.github.com/mtrolle/96f55822122ecabd3cc46190a6dc18a5

Using HTTParty: https://gist.github.com/mtrolle/dbd2cdf70f77a83b4178971aa79b6292

Thanks

mtrolle
  • 2,234
  • 20
  • 19
  • This is my example code: https://gist.github.com/mtrolle/96f55822122ecabd3cc46190a6dc18a5 – mtrolle Nov 06 '18 at 10:12
  • when you say 'one job' what are you refering to? rake task? active job? – Anthony Nov 06 '18 at 15:08
  • It's actually one run. Normally I would execute this via Rails' ActiveJob, but I'm reproducing the memory issue in the standalone file linked above. As mentioned this is when retrieving just 1 out of 15 days of data with memory consumption raising exponential for each extra day I parse. – mtrolle Nov 06 '18 at 18:20
  • You haven't specified the Ruby version, the host operating system, or the resource being parsed. Additionally, you haven't provided your full code. (unless you're downloading the images to memory and then discarding them without any further processing) So the only advice I can give is: don't use RestClient, which hasn't been updated in over a year. Use something more commonly used, like [httparty](https://github.com/jnunemaker/httparty). – anothermh Nov 07 '18 at 02:27
  • I don't think we've proven RestClient is the issue here but I agree, versions and platform would be helpful. – Anthony Nov 07 '18 at 14:22
  • Sorry - should have added version info, that's been added now. @anothermh I don't agree I haven't provided full code. This works fully and create the memory issue. In reality I do save downloaded image files and store additional meta data using ActiveRecord, but my testing shows it doesn't add much to the memory consumption, so this test script I created still makes we wonder why Garbage Collection isn't run. – mtrolle Nov 07 '18 at 14:47
  • It’s not going to be possible to help without access to the resource being parsed. – anothermh Nov 08 '18 at 23:25
  • I've updated my gist with URL's so it's now possible to execute it: https://gist.github.com/mtrolle/96f55822122ecabd3cc46190a6dc18a5 I've also tried myself with Ruby 2.5.1 with same high memory consumption results as well as tried on macOS with same result. I've also created a HTTParty version here https://gist.github.com/mtrolle/dbd2cdf70f77a83b4178971aa79b6292 which is actually having a higher memory consumption than RestClient in my tests. – mtrolle Nov 14 '18 at 14:02

1 Answers1

3

I believe it's all about the http client you are using: Rest-Client. Unfortunately it has some bad reputation of being memory-hungry. You should definitely look for some awesome gem that is both memory/time efficient.

I would highly recommend HTTP.rb or its http/2 successor HTTPX

For a good benchmark, please have a look at this awesome article by the author of another awesome gem Shrine: https://twin.github.io/httprb-is-great/

Here is what I found after replacing Rest-Client with HTTP.rb on my local machine:

Versions: Ruby: 2.5.3p105, HTTP.rb: 4.0.0, OS: Ubuntu 16.04

Total download size: 96.92Mb through 118 unique requests.

Memory consumption:

Total allocated: 7107283 bytes (83437 objects)
Total retained:  44221 bytes (385 objects)

So it allocated only 7Mb while downloading 96.92Mb compared to roughly 1Gb using Rest-Client.

Here is the snippet: https://gist.github.com/mtrolle/96f55822122ecabd3cc46190a6dc18a5#gistcomment-2774405

Wasif Hossain
  • 3,900
  • 1
  • 18
  • 20