13

I want to write approximately 50MB of data to an XML file.

I found Nokogiri (1.5.0) to be efficient for parsing when just reading and not writing. Nokogiri is not a good option to write to an XML file since it holds the complete XML data in memory until it finally writes it.

I found Builder (3.0.0) to be a good option but I'm not sure if it's the best option.

I tried some benchmarks with the following simple code:

  (1..500000).each do |k|
    xml.products {
      xml.widget {
        xml.id_ k
        xml.name "Awesome widget"
      }
    }
    end

Nokogiri takes about 143 seconds and also memory consumption gradually increased and ended at about 700 MB.

Builder took about 123 seconds and memory consumption was stable enough at 10 MB.

So is there a better solution to write huge XML files (50 MB) in Ruby?

Here's the code using Nokogiri:

require 'rubygems'
require 'nokogiri'
a = Time.now
builder = Nokogiri::XML::Builder.new do |xml|
  xml.root {
    (1..500000).each do |k|
    xml.products {
      xml.widget {
        xml.id_ k
        xml.name "Awesome widget"
      }
    }
    end
  }
end
o = File.new("test_noko.xml", "w")
o.write(builder.to_xml)
o.close
puts (Time.now-a).to_s

Here's the code using Builder:

require 'rubygems'
require 'builder'
a = Time.now
File.open("test.xml", 'w') {|f|
xml = Builder::XmlMarkup.new(:target => f, :indent => 1)

  (1..500000).each do |k|
    xml.products {
      xml.widget {
        xml.id_ k
        xml.name "Awesome widget"
      }
    }
    end

}
puts (Time.now-a).to_s
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Gaurav Shah
  • 5,223
  • 7
  • 43
  • 71
  • Re Parsing: Nokogiri is pretty user friendly, but when speed is the key, I go for just writing a sax parser (available in nogokiri as well). I have a handy utility class that I use to blazingly fast build an array of the stuff that I need from an xml (provided that the xml is pretty simple) https://gist.github.com/854726 if else I might have to write a custom saxparser. – sunkencity Sep 19 '11 at 06:29
  • You took it other way.. I want to build xml from array(active record ). – Gaurav Shah Sep 19 '11 at 06:42
  • It was a comment on "I found nokogiri (1.5.0) gem to be the most efficient to parse", my point being the most efficient way to parse is to use the saxparser api directly. – sunkencity Sep 19 '11 at 06:56

1 Answers1

16

Solution 1

If speed is your main concern, I'd just use libxml-ruby directly:

$ time ruby test.rb 

real    0m7.352s
user    0m5.867s
sys     0m0.921s

The API is pretty straight forward:

require 'rubygems'
require 'xml'
doc = XML::Document.new()
doc.root = XML::Node.new('root_node')
root = doc.root

500000.times do |k|
  root << elem1 = XML::Node.new('products')
  elem1 << elem2 = XML::Node.new('widget')
  elem2['id'] = k.to_s
  elem2['name'] = 'Awesome widget'
end

doc.save('foo.xml', :indent => false, :encoding => XML::Encoding::UTF_8)

Using :indent => true doesn't make much difference in this case, but for more complex XML files it might.

$ time ruby test.rb #(with indent)

real    0m7.395s
user    0m6.050s
sys     0m0.847s

Solution 2

Of course the fastest solution, and that doesn't build up on memory is just to write the XML manually but that will easily generate other sources of error like possibly invalid XML:

$ time ruby test.rb 

real    0m1.131s
user    0m0.873s
sys     0m0.126s

Here's the code:

f = File.open("foo.xml", "w")
f.puts('<doc>')
500000.times do |k|
  f.puts "<product><widget id=\"#{k}\" name=\"Awesome widget\" /></product>"
end
f.puts('</doc>')
f.close
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
sunkencity
  • 3,482
  • 1
  • 22
  • 19
  • but with this memory goes upto 600 mb .. that is way too wrong isn't it ? – Gaurav Shah Sep 19 '11 at 06:40
  • I added a way to do it without eating up the memory, it's faster, but you don't get any of the benefits of using a xml generator like automatic indentation, and the checks for validity etc. – sunkencity Sep 19 '11 at 06:52
  • in case of solution 2 , why not use builder itself ? , it would provide validation and also be safer , isn't it ? – Gaurav Shah Sep 19 '11 at 06:59
  • Because if you want to make it go faster you need to make it do less. There's lots of performance overhead to using nested blocks like builder does, and all kinds of magic. In the case of solution 2 there's very little code to run, therefore it's 100x faster. – sunkencity Sep 19 '11 at 07:01
  • 4
    Another way to do xml is just to use an erb template products.xml.erb and loop in there. – sunkencity Sep 19 '11 at 07:02
  • my xml structures varies a lot depending on the request. so erb doesn't seem to be good option , but yup you are right . – Gaurav Shah Sep 19 '11 at 07:05
  • make 500000 to 5000000 and libxml crashes.. even when system has 6 GB of ram.. just an info – Gaurav Shah Sep 19 '11 at 07:22