3

I want to parse a couple of thousands XML-files from a website(I have permission) and have to use SAX to avoid to load the file in memory. Then save them into a CSV-file.

The xml files looks like this:

<?xml version="1.0" encoding="UTF-8"?><educationInfo xmlns="http://skolverket.se/education/info/1.2" xmlns:ct="http://skolverket.se/education/commontypes/1.2" xmlns:nya="http://vhs.se/NyA-emil-extensions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" expires="2013-08-01" id="info.uh.su.HIA80D" lastEdited="2011-10-13T10:10:05" xsi:schemaLocation="http://skolverket.se/education/info/1.2 educationinfo.xsd">
  <titles>
    <title xml:lang="sv">Arkivvetenskap</title>
    <title xml:lang="en">Archival science</title>
  </titles>
  <identifier>HIA80D</identifier>
  <educationLevelDetails>
    <typeOfLevel>uoh</typeOfLevel>
    <typeOfResponsibleBody>statlig</typeOfResponsibleBody>
    <academic>
      <course>
        <type>avancerad</type>
      </course>
    </academic>
  </educationLevelDetails>
  <credits>
    <exact>60</exact>
  </credits>
  <degrees>
    <degree>Ingen examen</degree>
  </degrees>
  <prerequisites>
    <academic>uh</academic>
  </prerequisites>
  <subjects>
    <subject>
      <code source="vhs">10.300</code>
    </subject>
  </subjects>
  <descriptions>
    <ct:description xml:lang="sv">
      <ct:text>Arkivvetenskap rör villkoren för befintliga arkiv och modern arkivbildning med fokus på arkivarieyrkets arbetsuppgifter: bevara, tillgängliggöra och styra information. Under ett år behandlas bl a informations- och dokumenthantering, arkivredovisning, gallring, lagstiftning och arkivteori. I kursen ingår praktik, där man under handledning får arbeta med olika arkivarieuppgifter.</ct:text>
    </ct:description>
  </descriptions>
</educationInfo> 

I use this code-template, check my comments for questions:

class InfoData  < Nokogiri::XML::SAX::Document

  def initialize
    # do one-time setup here, called as part of Class.new
    # But what should I use hashes or arrays?
  end

  def start_element(name, attributes = [])
  # check the element name here and create an active record object if appropriate
  # How do I grab specific element like: ct:text ?
  # how do I grab root-element?
  end

  def characters(s)
     # save the characters that appear here and possibly use them in the current tag object
  end

  def end_element(name)
     # check the tag name and possibly use the characters you've collected
     # and save your activerecord object now
  end

end

parser = Nokogiri::XML::SAX::Parser.new(InfoData.new)

# How do I parse every xml-link? 
parser.parse_file('')

I wrote this method to grab the links, but don't know where in the class to use it or if I should use it there:

@items = Set.new 
def get_links(url)
  doc = Nokogiri::HTML(open(url))
  doc.xpath('//a/@href').each do |url|
  item = {}
  item[:url] = url.content
  items << item
end
matt
  • 78,533
  • 8
  • 163
  • 197
  • If that XML sample is a full XML file, I'd use the DOM, rather than SAX, because it's a bit easier. These days, most hosts have multiple gigabytes of RAM, making SAX less important. BIG XML files will be processed faster by SAX but your development time will probably take longer. – the Tin Man Mar 30 '12 at 16:07
  • @theTinMan I have tried to parse it using DOM and it wont work. it about 46000 xml files. The proper way it to use SAX parsing.. –  Mar 30 '12 at 16:57

2 Answers2

0
require 'nokogiri'

class LinkGrabber < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    if name == 'a'
      puts Hash[attrs]['href']
    end
  end
end

parser = Nokogiri::XML::SAX::Parser.new(LinkGrabber.new)
parser.parse(File.read(ARGV[0], 'rb'))

Now you can use this in a pipeline:

find . -name "*.xml" -print0 | xargs -P 20 -0 -L 1 ruby parse.rb > links

But that does startup ruby every time. So you're better of using jruby (which is faster anyway) and threach.

require 'threach'
require 'find'
require 'nokogiri'

class LinkGrabber < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    if name == 'a'
      puts Hash[attrs]['href']
    end
  end
end

# let's hope it's threadsave
parser = Nokogiri::XML::SAX::Parser.new(LinkGrabber.new)
Find.find(ARGV[0]).threach do |path|
  next unless File.file?(path)
  parser.parse(File.read(path))
end
Reactormonk
  • 21,472
  • 14
  • 74
  • 123
  • awesome mate. What is a pipeline and did you parse the XML document or the links only? –  Mar 30 '12 at 22:24
  • That part of `SAX` takes all elements with the name `'a'` and `puts` their `'href'` attribute. The pipeline contains some `xargs` and parallel process magic, I'd recommend using the jruby solution, because it's pure ruby. It's no use to run that with the normal ruby, because MRI doesn't support real threads. With jruby, you can use all of your cores. – Reactormonk Mar 30 '12 at 22:27
  • So this parser is only for grabbing the links or is it for every xml? Can you please tell me how you take elements like etc: titel, degree adn subject and store dem in a hash.. –  Mar 30 '12 at 23:00
  • Give me an example input and an example output. – Reactormonk Mar 30 '12 at 23:28
  • If you check my old s.o question here:http://stackoverflow.com/questions/9573997/how-to-crawl-the-right-way/9577309#9577309, you see my input and output examples.. but DOM parsing is not working for my project. –  Mar 30 '12 at 23:33
-1

maybe this can work:

   require 'open-uri'

   def get_links(url)

          doc = Nokogiri::HTML(open(url))

          doc.xpath('//a/@href').each do |href|

            parser.parse_io(open(href))

          end

    end
pgon
  • 55
  • 1
  • `Nokogiri::HTML(open(url))` is wrong for a XML document. `Nokogiri::HTML` relaxes the parser to allow for HTML's notorious lack of standards. Instead, use `Nokogiri::XML()` to parse the XML to use the strict parser. – the Tin Man Mar 30 '12 at 16:09
  • No `#xpath` in SAX. That's DOM. – Reactormonk Mar 30 '12 at 21:53