how to get my rexml/nokogiri script run faster

Question

I have this ruby script that collects 46344 xml-links and then collects 16 elements-nodes in every xml file. The last part of the proccess is that it stores it in a CSV file. The problem that I have is that it takes to long. It takes more than 1-2 hour..

Here is the script without the link that have all the XML-links, I cant provid the link beacuse its company stuff.. I hope its cool.

Here is the script, and it works but it takes to long:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML

@urls = Array.new
@ID = Array.new
@titleSv = Array.new
@titleEn = Array.new
@identifier = Array.new
@typeOfLevel = Array.new
@typeOfResponsibleBody = Array.new
@courseTyp = Array.new
@credits = Array.new
@degree = Array.new
@preAcademic = Array.new
@subjectCodeVhs = Array.new
@descriptionSv = Array.new
@visibleToSweApplicants = Array.new
@lastedited = Array.new
@expires = Array.new

# Hämtar alla XML-länkar
htmldoc = Nokogiri::HTML(open('A SITE THAT HAVE ALL THE LINKS'))
# Hämtar alla länkar för xml-filerna och sparar dom i arrayn urls
htmldoc.xpath('//a/@href').each do |links|
  @urls << links.content
end

@urls.each do |url|
  # Loop throw the XML files and grab element nodes
  xmldoc = REXML::Document.new(open(url).read)
  # Root element
  root = xmldoc.root
  # Hämtar info-id
  @ID << root.attributes["id"]
  # TitleSv
  xmldoc.elements.each("/ns:educationInfo/ns:titles/ns:title[1]"){
    |e| @titleSv << e.text
  }
  # TitleEn
  xmldoc.elements.each("/ns:educationInfo/ns:titles/ns:title[2]"){
    |e| @titleEn << e.text
  }
  # Identifier
  xmldoc.elements.each("/ns:educationInfo/ns:identifier"){
    |e| @identifier << e.text
  }
  # typeOfLevel
  xmldoc.elements.each("/ns:educationInfo/ns:educationLevelDetails/ns:typeOfLevel"){
    |e| @typeOfLevel << e.text
  }
  # typeOfResponsibleBody
  xmldoc.elements.each("/ns:educationInfo/ns:educationLevelDetails/ns:typeOfResponsibleBody"){
     |e| @typeOfResponsibleBody << e.text
  }
  # courseTyp
  xmldoc.elements.each("/ns:educationInfo/ns:educationLevelDetails/ns:academic/ns:courseOfferingPackage/ns:type"){
     |e| @courseTyp << e.text
  }
  # credits
  xmldoc.elements.each("/ns:educationInfo/ns:credits/ns:exact"){
     |e| @credits << e.text
  }
  # degree
  xmldoc.elements.each("/ns:educationInfo/ns:degrees/ns:degree"){
     |e| @degree << e.text
  }
  # @preAcademic
  xmldoc.elements.each("/ns:educationInfo/ns:prerequisites/ns:academic"){
    |e| @preAcademic << e.text
  }
  # @subjectCodeVhs
  xmldoc.elements.each("/ns:educationInfo/ns:subjects/ns:subject/ns:code"){
    |e| @subjectCodeVhs << e.text
  }
  # DescriptionSv
  xmldoc.elements.each("/educationInfo/descriptions/ct:description/ct:text"){
    |e| @descriptionSv << e.text
  }
  # Hämtar dokuments utgångs-datum
  @expires << root.attributes["expires"]
  # Hämtar dokuments lastedited
  @lastedited << root.attributes["lastEdited"]

  # Lagrar dom i uni.CSV
  CSV.open("eduction_normal.csv", "wb") do |row|
    (0..@ID.length - 1).each do |index|
      row << [@ID[index], @titleSv[index], @titleEn[index], @identifier[index], @typeOfLevel[index], @typeOfResponsibleBody[index], @courseTyp[index], @credits[index], @degree[index], @preAcademic[index], @subjectCodeVhs[index], @descriptionSv[index], @lastedited[index], @expires[index]]
    end
  end
end

I'd start by doing some profiling (ruby prof) etc to see where the bottleneck is: is it the XML parsing, network bandwidth when downloading the files to parse, etc — Frederick Cheung, Feb 25 '12 at 23:20
I second @FrederickCheung's comment. I'll add that it probably has more to do with network access than with the actual parsing. — s.m., Feb 25 '12 at 23:33
Agreed that it is likely network latency. But why are you using Nokogiri *and* REXML? if it was all Nokogiri it would be faster. — Mark Thomas, Feb 28 '12 at 02:28
Beacuse nokogiri cant for some reason parse the root element node.. and REXML only parse xml files or mabye it is just my code. @MarkThomas do you know threads in ruby? I cant grasp my head around it.. — , Feb 28 '12 at 22:11
@SHUMAcupcake Yes but you should either edit this question or ask another one regarding a threaded approach. — Mark Thomas, Feb 29 '12 at 02:41

score 1 · Answer 1 · answered Feb 25 '12 at 23:46

1

If it's network access you could start threading it and/or start using Jruby which can use all cores on your processor. If you have to do it often you will have to work out a read write strategy that serves you best without blocking.

answered Feb 25 '12 at 23:46

three

8,262
3
35
39

I am checking that out to dude, can you provide som example with my code? =) cheers – Feb 26 '12 at 20:14
no, I don't have any current code examples, sorry. I'd iterate through the urls in steps of ten or so and let each work out in a single thread. And also let it write. But hey, try to find some docs on threads. I really have no idea how I pulled it off. – three Feb 26 '12 at 21:51

how to get my rexml/nokogiri script run faster

1 Answers1