How to fix slow Nokogiri parsing

Question

I have a Rake task in my Rails app which looks into a folder for an XML file, parses it, and saves it to a database. The code works OK, but I have about 2100 files totaling 1.5GB, and processing is very slow, about 400 files in 7 hours. There are approximately 600-650 contracts in each XML file, and each contract can have 0 to n attachments. I did not paste all values, but each contract has 25 values.

To speed-up the process I use Activerecord's Import gem, so I am building an array per file and when the whole file is parsed. I do a mass import to Postgres. Only if a record is found is it directly updated and/or a new attachment inserted, but this is like 1 out of 100000 records. This helps a little, instead of doing new record per contract, but now I see that the slow part is XML parsing. Can you please look if I am doing something wrong in my parsing?

When I tried to print the arrays I am building, the slow part was until it loaded/parsed whole file and starts printing array by array. Thats why I assume the probem with speed is in parsing as Nokogiri loads the whole XML before it starts.

require 'nokogiri'
require 'pp'
require "activerecord-import/base"

ActiveRecord::Import.require_adapter('postgresql')
namespace :loadcrz2 do
  desc "this task load contracts from crz xml files to DB"
  task contracts: :environment do
    actual_dir = File.dirname(__FILE__).to_s
    Dir.foreach(actual_dir+'/../../crzfiles') do |xmlfile|

        next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'

         page = Nokogiri::XML(open(actual_dir+"/../../crzfiles/"+xmlfile))
         puts xmlfile
         cons = page.xpath('//contracts/*')
         contractsarr = []
         @c =[]
         cons.each do |contract|
            name = contract.xpath("name").text
            crzid = contract.xpath("ID").text
            procname = contract.xpath("procname").text
            conname = contract.xpath("contractorname").text
            subject = contract.xpath("subject").text
            dateeff = contract.xpath("dateefficient").text
            valuecontract = contract.xpath("value").text

            attachments = contract.xpath('attachments/*')
            attacharray = []
            attachments.each do |attachment|
                attachid = attachment.xpath("ID").text
                attachname = attachment.xpath("name").text
                doc = attachment.xpath("document").text
                size = attachment.xpath("size").text

                arr = [attachid,attachname,doc,size]
                attacharray.push arr
            end
            @con = Crzcontract.find_by_crzid(crzid)
            if @con.nil?
                @c=Crzcontract.new(:crzname => name,:crzid => crzid,:crzprocname=>procname,:crzconname=>conname,:crzsubject=>subject,:dateeff=>dateeff,:valuecontract=>valuecontract)
            else
                @con.crzname = name
                @con.crzid = crzid
                @con.crzprocname=procname
                @con.crzconname=conname
                @con.crzsubject=subject
                @con.dateeff=dateeff
                @con.valuecontract=valuecontract
                @con.save!
            end
            attacharray.each do |attar|
            attachid=attar[0]
            attachname=attar[1]
            doc=attar[2]
            size=attar[3]

                @at = Crzattachment.find_by_attachid(attachid)
                if @at.nil?

                if @con.nil?
                    @c.crzattachments.build(:attachid=>attachid,:attachname=>attachname,:doc=>doc,:size=>size)
                else
                    @a=Crzattachment.new
                    @a.attachid = attachid
                    @a.attachname = attachname
                    @a.doc = doc
                    @a.size = size
                    @a.crzcontract_id=@con.id
                    @a.save!
                end
                end

         end
            if @c.present?
            contractsarr.push @c
            end
            #p @c

         end
         #p contractsarr

         puts "done"
         if contractsarr.present?
         Crzcontract.import contractsarr, recursive: true
         end
         FileUtils.mv(actual_dir+"/../../crzfiles/"+xmlfile, actual_dir+"/../../crzfiles/archive/"+xmlfile)


        end
  end
end

I don't think this is doing what you think this is doing. There are many problems with this code. for example `@c = []` then `@c= Crzcontract.new` then `if @c.present?` (do you mean the Array or the the Crzcontract instance? also the `@con` logic is saving immediately not a mass import. I am not sure nokogiri is your issue here but rather the logic and the number of logical database hits especially since your concept of "updating" may have no impact at all if all the information is the same. Best bet is to split up this logic into classes so you can better determine what is going on — engineersmnky, Dec 30 '16 at 21:50
the logic is to set @c as empty for each iteration, if contract is not found then build new array for activerecord-import and push this array into contractsarr which is at the and used for that mass import. When I look into DB, data are all ok. The @con logic is used to save immediately as I wrote, but it is only in 1 record per 100k so I assume this should not be a perf issue. this is just lame workaround as I had issues with upsert in activerecord-import gem. — Mi Ro, Dec 30 '16 at 21:56
You are overwriting the instance variable `@c` if `@con` is `nil`. — engineersmnky, Dec 30 '16 at 21:57
You can make things faster by using `ActiverRecord::Base.transaction do ...` , by opening a transaction you will avoid opening and closing a database connection which will save alot of time — amrdruid, Dec 30 '16 at 22:01
engineersmnky I got your point, that @c=[] is forgotten from old code when I needed to access @c outside loop so I just defined it as empty, but when I removed it works same. The logic is I am building @c if contract is not found, to prepare array of new contracts for import gem. — Mi Ro, Dec 30 '16 at 22:08
Driud, I am not sure if transaction will help because my import is pretty fast (1 file in a second) but until I build an array for that import, that takes forever — Mi Ro, Dec 30 '16 at 22:08
How big is your biggest file? If you don't import anything into the DB, don't create any big array, parse the xml but don't output anything, is it still slow? — Eric Duminil, Dec 30 '16 at 22:44
biggest file has 11MB, but most of them dont get over 7MB. I dont understand your point, I need to import each file into DB — Mi Ro, Dec 30 '16 at 22:48
The point of the suggestion was that to understand performance issues you need to establish which part of the process is expensive, and one good way of doing that is to omit some parts of the process and observe the impact on performance of doing so. — Michael Kay, Dec 30 '16 at 23:01
got it..ok Eric is right, when I just keep it parsing, it is fast so the problem looks with building arrays or looking into that DB for possible duplicate — Mi Ro, Dec 30 '16 at 23:10
found a problem, when I keep arrays, just disable looking for duplicates (looking for crzid or attachment id) it is pretty fast so I need to find a way how to speed up looking for those ids — Mi Ro, Dec 30 '16 at 23:15
Hmm seems like someone might have suggested logical database hits as the root cause. — engineersmnky, Dec 31 '16 at 01:32
yep, my stupidity, I didnt expect that read could be so slow so I was focusing on speeding up write to DB. When I added index on those fields I use in find_by it is pretty fast thanks for pointing me there — Mi Ro, Dec 31 '16 at 10:28

score 2 · Accepted Answer · edited May 23 '17 at 11:46

There are a number of problems with the code. Here are some ways to improve it:

actual_dir = File.dirname(__FILE__).to_s

Don't use to_s. dirname is already returning a string.

actual_dir+'/../../crzfiles', with and without a trailing path delimiter is used repeatedly. Don't make Ruby rebuild the concatenated string over and over. Instead define it once, but take advantage of Ruby's ability to build the full path:

File.absolute_path('../../bar', '/path/to/foo') # => "/path/bar"

So use:

actual_dir = File.absolute_path('../../crzfiles', __FILE__)

and then refer to actual_dir only:

Dir.foreach(actual_dir)

This is unwieldy:

next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'

I'd do:

next if (xmlfile[0] == '.' || xmlfile == 'archive')

or even:

next if xmlfile[/^(?:\.|archive)/]

Compare these:

'.hidden'[/^(?:\.|archive)/] # => "."
'.'[/^(?:\.|archive)/] # => "."
'..'[/^(?:\.|archive)/] # => "."
'archive'[/^(?:\.|archive)/] # => "archive"
'notarchive'[/^(?:\.|archive)/] # => nil
'foo.xml'[/^(?:\.|archive)/] # => nil

The pattern will return a truthy value if it starts with '.' or is equal to 'archive'. It's not as readable but it's compact. I'd recommend the compound conditional test though.

In some places, you're concatenating xmlfile, so again let Ruby do it once:

xml_filepath = File.join(actual_dir, xmlfile)

which will honor the file path delimiter for whatever OS you're running on. Then use xml_filepath instead of concatenating the name:

xml_filepath = File.join(actual_dir, xmlfile)))
page = Nokogiri::XML(open(xml_filepath))

[...]

FileUtils.mv(xml_filepath, File.join(actual_dir, "archive", xmlfile)

join is a good tool so take advantage of it. It's not just another name for concatenating strings, because it's also aware of the correct delimiter to use for the OS the code is running on.

You use a lot of instances of:

xpath("some_selector").text

Don't do that. xpath, along with css and search return a NodeSet, and text when used on a NodeSet can be evil in a way that'll hurtle you down a very steep and slippery slope. Consider this:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<root>
  <node>
    <data>foo</data>
  </node>
  <node>
    <data>bar</data>
  </node>
</root>
EOT

doc.search('//node/data').class # => Nokogiri::XML::NodeSet
doc.search('//node/data').text # => "foobar"

The concatenation of the text into 'foobar' can't be split easily and it's a problem we see here in questions too often.

Do this if you expect getting a NodeSet back because of using search, xpath or css:

doc.search('//node/data').map(&:text) # => ["foo", "bar"]

It's better to use at, at_xpath or at_css if you're after a specific node because then text will work as you'd expect.

See "How to avoid joining all text from Nodes when scraping" also.

There's a lot of replication that could be DRY'd. Instead of this:

name = contract.xpath("name").text
crzid = contract.xpath("ID").text
procname = contract.xpath("procname").text

You could do something like:

name, crzid, procname = [
  'name', 'ID', 'procname'
].map { |s| contract.at(s).text }

How to fix slow Nokogiri parsing

1 Answers1