0

I'm trying to parse a large XML file with Nokogiri's SAX parser.

It works great when I read the same data from a file, but the memory goes to over 1GB when the data is read from Redis.

Here's the most basic code I can use to replicate the issue.

Any ideas why it's doing this?

class WordsList < Nokogiri::XML::SAX::Document

  def start_element name, attrs = []
  end

end

And here's how I'm loading it:

  doc              = WordsList.new
  parser           = Nokogiri::XML::SAX::Parser.new doc
  parser.parse row_data

The row_data method is what gets the XML from Redis.

Thanks.

99miles
  • 10,942
  • 18
  • 78
  • 123

1 Answers1

0

What happens to your memory when you run this:

require 'nokogiri'

File.open('xml.xml', 'w') do |f|
  f.puts '<?xml version="1.0" encoding="UTF-8"?>'
  f.puts '<my_root>'

  xml = <<'END_OF_XML'
  <note>
  <to>Tove</to>
  <from gender="F" age="25" address="123 Maple St.">Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
  </note>

  <note>
  <to>Tove</to>
  <from gender="F" age="25" address="123 Apple St.">Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
  </note>

END_OF_XML

  f.puts xml * 500_000 
  f.puts '</my_root>'
end

class WordsList < Nokogiri::XML::SAX::Document

  attr_writer :sort_key
  attr_reader :obj

  def initialize
    @obj      = []
    @sort_key = :address
    @limit    = 10
  end

  def sort_key
    @sort_key.to_s
  end

  def start_element name, attrs = []
    add_to_list Hash[attrs] if name == 'from'
  end

  def add_to_list hash
    @obj.push hash
    @obj = sorted.first(@limit)
  end

  def sorted
    @obj.sort_by do |item|
      begin
        Float(item[sort_key].gsub(",", ""))
      rescue ArgumentError
        item[sort_key].downcase
      end
    end.reverse
  end

end

my_handler = WordsList.new

parser = Nokogiri::XML::SAX::Parser.new(my_handler)
parser.parse(File.open('xml.xml'))
7stud
  • 46,922
  • 14
  • 101
  • 127
  • That has no problems. I narrowed it down, and even with an essentially empty SAX document, the memory jumps way up. I'll update the post. – 99miles May 20 '14 at 16:27
  • `The row_data method is what gets the XML from Redis.` Then if you are not already doing so, start looking into what that method is doing. – 7stud May 20 '14 at 19:19
  • Please post the entirety of the code for row_data() including any require statements that are necessary for the code to work. – 7stud May 21 '14 at 17:15