How to crawl the right way?

Question

I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.

There are exactly 43612 XML files that I want to crawl and store in a CSV file.

My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.

I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074

I am using two libraries beacuse I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.

My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?

HERE IS MY SCRIPT:

Require the necessary lib:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML

Create bunch of array to store that grabs data:

@urls = Array.new 
@ID = Array.new
@titleSv = Array.new
@titleEn = Array.new
@identifier = Array.new
@typeOfLevel = Array.new

Grab all the xml links from a spec site and store them in a array called @urls

htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))

htmldoc.xpath('//a/@href').each do |links|
  @urls << links.content
end

Loop throw the @urls array, and grab every element node that I want to grab with xpath.

@urls.each do |url|
  # Loop throw the XML files and grab element nodes
  xmldoc = REXML::Document.new(open(url).read)
  # Root element
  root = xmldoc.root
  # Hämtar info-id
  @ID << root.attributes["id"]
  # TitleSv
  xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
    |e| m = e.text 
        m = m.to_s
        next if m.empty? 
        @titleSv << m
  }

Then store them in a CSV file.

 CSV.open("eduction_normal.csv", "wb") do |row|
    (0..@ID.length - 1).each do |index|
      row << [@ID[index], @titleSv[index], @titleEn[index], @identifier[index], @typeOfLevel[index], @typeOfResponsibleBody[index], @courseTyp[index], @credits[index], @degree[index], @preAcademic[index], @subjectCodeVhs[index], @descriptionSv[index], @lastedited[index], @expires[index]]
    end
  end

your code is pretty strange. You can try to use [pioneer gem](https://github.com/fl00r/pioneer) to make your own asynchronous crawler. Anyway you should redesign your code. It shouldn't work like this — fl00r, Mar 05 '12 at 21:31
Cool gem mate, I only wish that that where some more example and a nice tut. — , Mar 05 '12 at 22:15
I even don't understand what is going on here. Each url consists only one id, titleSv, titleEn, identifier etc? or it consists only one Id and many other properties? The idea is to move all this CSV logic into `each` iterator. So you won't firstly save all data into arrays, but will straightly save it to CSV — fl00r, Mar 05 '12 at 22:23
Each url contains a XML file that have 15 element-nodes. Ok, I have to find out how to do that with CSV. can't u show my how you can do this with you gem? — , Mar 05 '12 at 22:37
As a comment about the practice of grabbing 43612 XML files; Unless the company wants you to grab things that way, going after that many files is likely to get you banned. You should see about a bulk feed instead. It'd be a lot faster to pull in a big tarball or zip and uncompress it on your side. — the Tin Man, Mar 05 '12 at 23:29
Sorry for the late replay, @theTinMan yeah.. I have the premission so it cool. — , Mar 09 '12 at 22:04

score 4 · Accepted Answer · answered Mar 05 '12 at 22:42

It's hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what's blocking you.

Libraries

You're using a lot of libraries here that probably aren't necessary.

You use both REXML and Nokogiri. They both do the same job. Except Nokogiri is much better at it (benchmark).

Use Hashes

Instead of storing data at index in 15 arrays, have one set of hashes.

For instance,

items = Set.new

doc.xpath('//a/@href').each do |url|
  item = {}
  item[:url] = url.content
  items << item
end

items.each do |item|
  xml = Nokogiri::XML(open(item[:url]))

  item[:id] = xml.root['id']
  ...
end

Collect the data, then write to file

Now that you have your items set, you can iterate over it and write to the file. This is much faster than doing it line by line.

Be DRY

In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.

xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
    |e| m = e.text 
     m = m.to_s
     next if m.empty? 
     @titleSv << m
}

Move what's common to a method

def get_value(xml, path)
   str = ''
   xml.elements.each(path) do |e|
     str = e.text.to_s
     next if str.empty?
   end

   str
end

And move anything constant to another hash

xml_paths = {
  :title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
  :title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
  ...
}

Now you can combine these techniques to make for much cleaner codes

item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])

I hope this helps!

@lan Bishop, can you provide the hole concept on gist.github.. I can make the hole thing work — , Mar 20 '12 at 19:16
I was confused and intrigued by a "hole concept" until I realized "hole" really means "whole" — David J., Jun 21 '12 at 15:30
@DavidJames: I solved my problem by building a script that downloaded every file and then building a nokogiri Sax parser.. and it did the job very fast! So dont use REXML, use nokogiri if you are doing something like this. cheers! — , Jun 21 '12 at 16:08

score 2 · Answer 2 · answered Mar 05 '12 at 22:46

It won't work without your fixings. And I believe you should do like @Ian Bishop said to refactor your parsing code

require 'rubygems'
require 'pioneer'
require 'nokogiri'
require 'rexml/document'
require 'csv'

class Links < Pioneer::Base
  include REXML
  def locations
    ["http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI"]
  end

  def processing(req)
    doc = Nokogiri::HTML(req.response.response)
    htmldoc.xpath('//a/@href').map do |links|
      links.content
    end
  end
end

class Crawler < Pioneer::Base
  include REXML
  def locations
    Links.new.start.flatten
  end

  def processing(req)
    xmldoc = REXML::Document.new(req.respone.response)
    root = xmldoc.root
    id = root.attributes["id"]
    xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]") do |e|
      title = e.text.to_s
      CSV.open("eduction_normal.csv", "a") do |f|
        f << [id, title ...]
      end
    end
  end
end

Crawler.start
# or you can run 100 concurrent processes
Crawler.start(concurrency: 100)

@SHUMAcupcake: there is [good documentation about pioneer on GitHub](https://github.com/fl00r/pioneer) thanks to fl00r. — David J., Jun 21 '12 at 15:38

score 1 · Answer 3 · answered Mar 06 '12 at 02:58

If you really want to speed it up, you're going to have to go concurrent.

One of the simplest ways is to install JRuby and then run your application with one small modification: install either the 'peach' or 'pmap' gems and then change your items.each to items.peach(n) (parallel each), where n is the number of threads. You'll need at least one thread per CPU core, but if you put I/O in your loop then you'll want more.

Also, use Nokogiri, it's much faster. Ask a separate Nokogiri question if you need to solve something specific with Nokogiri. I'm sure it can do what you need.

How to crawl the right way?

3 Answers3

Libraries

Use Hashes

Collect the data, then write to file

Be DRY

Linked