Rails: Troubleshooting a memory leak on Heroku ( perhaps Nokogiri)

Question

I am using Rails 3.1.1 and deploying on Heroku. I am using open-uri and Nokogiri.

I am trying to troubleshoot a memory leak (?) that occurs while I am trying to fetch and parse an xml-file. The XML feed I am fetching and trying to parse is 32 Mb.

I am using the following code for it:

require 'open-uri'   
open_uri_fetched = open(feed.fetch_url)
xml_list = Nokogiri::HTML(open_uri_fetched)

where feed.fetch_url is an external xml-file.

It seems that while parsing the xml_list with Nokogiri (the last line in my code) the memory usage explodes up to 540 Mb usage and continues to increase. That doesn't seem logical since the XML-file is only 32 Mb.

I have looked all over for ways to analyze this better (e.g. ruby/ruby on rails memory leak detection) but I can't understand how to use any of them. MemoryLogic seems simple enough but the installation instructions seem to lack some info...

So, please help me to either determine whether the code above should use that much memory or (super simple) instructions on how to find the memory leak.

Thanks in advance!

Frederick Cheung · Accepted Answer · 2012-05-16T07:06:31.640

2

Parsing a large xml file and turning it into a document tree will in general create an in memory representation that is far larger that the xml data itself. Consider for example

<foo attr="b" />

which is only 16 bytes long (assuming a single byte character encoding). The in memory representation of this document will include an object to represent the element itself, probably an (empty) collection of children, a collection of attributes for that element containing at least one thing. The element itself has properties likes its name, namespace pointers to its parent document and so on. The data structures for each of those things is probably going to be over 16 bytes, even before they're wrapped in ruby objects by nokogiri (each of which has a memory footprint which is almost certainly >= 16 bytes).

If you're parsing large xml files you almost certainly want to use a event driven parser like a SAX parser that responds to elements as they are encountered in the document rather than building a tree representation on the entire document an then working on that.

edited May 16 '12 at 07:06

answered May 15 '12 at 12:41

Frederick Cheung

83,189
8
152
174

Is that pretty much what I am doing with the code represented? Heroku mentioned that I should not build a tree representation but I am not sure if that is what I actually do. I load the XML-file and go through each "product" element. – Christoffer May 15 '12 at 14:31
You are building a tree representation (that's what Nokogiri::HTML does) and then walking that representation. – Frederick Cheung May 15 '12 at 15:29
this is a good answer. For such a big file you really have no better option, unless you want to use some C app for that. – Ismael Abreu May 16 '12 at 00:04
It took a while before I could try this out, but it works better with this solution. – Christoffer Oct 02 '12 at 08:36

pjammer · Answer 2 · 2012-05-15T12:08:21.783

1

Are you sure you aren't running up against the upper limits of what heroku allows for 'long running tasks'?

I've timed out and had stuff just fail on me all the time due to some of the restrictions heroku puts on the freebie people.

I mean, can you replicate this in your dev? How long does it take on your machine to do what you want?

EDIT 1:

What is this too by the way?

open_uri_fetched = open(feed.fetch_url)

Where is the url it is fetching? Does it bork there or on the actually Nokogiri call. How long does this fetch take anyways?

edited May 15 '12 at 12:08

answered May 15 '12 at 11:21

pjammer

9,489
5
46
56

It could be something like that since the script (if run freely) would take hours. However, when I get this error it is after less than a minute, while loading the feed. I do not get the same problem in my dev, but I don't have the same error codes or limits there I think. – Christoffer May 15 '12 at 12:05
Re your edit: The code presented is not the full code but rather the relevant part. feed.fetch_url is a string e.g. "http://domain.com/feed.xml" – Christoffer May 15 '12 at 14:44

Rails: Troubleshooting a memory leak on Heroku ( perhaps Nokogiri)

2 Answers2