ROR/Hpricot: parsing a site and searching/comparing strings with regex

Question

I just started with Ruby On Rails, and want to create a simple web site crawler which:

Goes through all the Sherdog fighters' profiles.
Gets the Referees' names.
Compares names with the old ones (both during the site parsing and from the file).
Prints and saves all the unique names to the file.

An example URL is: http://www.sherdog.com/fighter/Fedor-Emelianenko-1500

I am searching for the tag entries like <span class="sub_line">Dan Miragliotta</span>, unfortunately, additionally to the proper Referee names I need, the same kind of class is used with:

The date.
"N/A" when the referee name is not known.

I need to discard all the results with a "N/A" string as well as any string which contains numbers. I managed to do the first part but couldn't figure out how to do the second. I tried searching, thinking and experimenting, but, after experimenting and rewriting, managed to break the whole program and don't know how to (properly) fix it:

require 'rubygems'
require 'hpricot'
require 'simplecrawler'

# Set up a new crawler
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]

# The crawler yields a Document object for each visited page.
sc.crawl { |document|
# Parse page title with Hpricot and print it
hdoc = Hpricot(document.data)

(hdoc/"td/span[@class='sub_line']").each do |span|
  if span.inner_html == 'N/A' || Regexp.new(".*/\d\.*$").match(span.inner_html)
    # puts "Test"
  else
    puts span.inner_html
    #File.open("File_name.txt", 'a') {|f| f.puts(hdoc.span.inner_html) } 
  end
end
}

I would also appreciate help with ideas on the rest of the program: How do I properly read the current names from the file, if the program is run multiple times, and how do I make the comparisons for the unique names?

Edit:

After some proposed improvements, here is what I got:

require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
#require 'open-uri'

sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1

sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
names = doc.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
puts names
}

Unfortunately, the code still doesn't work - it returns a blank.

If instead of doc = Nokogiri::HTML(document.data), I write doc = Nokogiri::HTML(open(document.data)), then it gives me the whole page, but, parsing still doesn't work.

In your edit, `doc.css('.sub-line')` fails to find anything. The only classes found in the `td` tags are: `["col_one", "col_two", "col_three", "col_four", "col_five", "col_six"]`. — the Tin Man, Oct 11 '12 at 03:33
I don't really understand what to do. :( ``doc.css('.col_four').map(&:content).uniq.reject { |c| c == 'N/A' }`` doesn't work. — Mikko Vedru, Oct 11 '12 at 03:45

Nick Colgan · Answer 1 · 2012-10-11T02:48:28.403

2

hpricot isn't maintained anymore. How about using nokogiri instead?

names = document.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
=> ["Yuji Shimada", "Herb Dean", "Dan Miragliotta", "John McCarthy"]

A breakdown of the different parts:

document.css('td:nth-child(4) .sub-line')

This returns an array of html elements with the class name sub-line that are in the forth table column.

.map(&:content)

For each element in the previous array, return element.content (the inner html). This is equivalent to map({ |element| element.content }).

.uniq

Remove duplicate values from the array.

.reject { |c| c == 'N/A' }

Remove elements whose value is "N/A"

edited Oct 11 '12 at 02:48

answered Oct 11 '12 at 02:40

Nick Colgan

5,488
25
36

Thanks for the quick answer and explanations! Didn't know that Hpricot isn't maintained. You answer doesn't work, though. :( "_undefined method `css' for :SimpleCrawler::Document (NoMethodError)_". Here is the code: `require 'rubygems'` require 'simplecrawler' require 'nokogiri' sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500") sc.maxcount = 1 sc.crawl { |document| names = document.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' } puts names }` What am I doing wrong? – Mikko Vedru Oct 11 '12 at 02:56
It doesn't work because you're trying to use an unmaintained parser (Hpricot) which doesn't know about the `css` method, instead of Nokogiri which implements the `css` method. – the Tin Man Oct 11 '12 at 03:22
Apparently SimpleCrawler is tightly coupled to hpricot and also isn't well maintained. Why not try [anemone](http://anemone.rubyforge.org/)? – Nick Colgan Oct 11 '12 at 04:10
@Nick Colgan. Thanks for the tip. I will look into anemone. – Mikko Vedru Oct 11 '12 at 11:59

pguardiario · Accepted Answer · 2012-10-11T04:20:35.143

0

You would use array math (-) to compare them:

get referees from the current page

current_referees = doc.search('td[4] .sub_line').map(&:inner_text).uniq - ['N/A']

read old referees from the file

old_referees = File.read('old_referees.txt').split("\n")

use Array#- to compare them

new_referees = current_referees - old_referees

write the new file

File.open('new_referees.txt','w'){|f| f << new_referees * "\n"}

edited Oct 11 '12 at 04:20

answered Oct 11 '12 at 04:05

pguardiario

53,827
19
119
159

Thanks! Now it finally works! Not only the current problem is solved, but the task is finished. I salute you! :) – Mikko Vedru Oct 11 '12 at 04:28

score 0 · Answer 3 · answered Oct 11 '12 at 04:08

0

This will return all the names, ignoring dates and "N/A":

puts doc.css('td span.sub_line').map(&:content).reject{ |s| s['/'] }.uniq

It results in:

Yuji Shimada
Herb Dean
Dan Miragliotta
John McCarthy

Adding these to a file and removing duplicates is left as an exercise for you, but I'd use some magical combination of File.readlines, sort and uniq followed by a bit of File.open to write the results.

answered Oct 11 '12 at 04:08

the Tin Man

158,662
42
215
303

I can't give you any reputation points, so I can only thank you for all the help! :) – Mikko Vedru Oct 11 '12 at 04:31

score 0 · Answer 4 · answered Oct 11 '12 at 05:43

Here is the final answer

require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
require 'open-uri'

# Mute log messages
module SimpleCrawler
   class Crawler
      def log(message)
      end
   end
end

n = 0  #  Counters how many pages/profiles processed
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 150000
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]

old_referees = File.read('referees.txt').split("\n")

sc.crawl { |document|
doc = Nokogiri::HTML(document.data)

current_referees = doc.search('td[4] .sub_line').map(&:text).uniq - ['N/A']
new_referees = current_referees - old_referees

n +=1
# If new referees found, print statistics
if !new_referees.empty? then
    puts n.to_s + ". " + new_referees.length.to_s + " new : " + new_referees.to_s + "\n"
end

new_referees = new_referees + old_referees
old_referees = new_referees.uniq
old_referees.reject!(&:empty?)

# Performance optimization. Saves only every 10th profile.
if n%10 == 0 then 
    File.open('referees.txt','w'){|f| f << old_referees * "\n" }
end
}
File.open('referees.txt','w'){|f| f << old_referees * "\n" }

Did you really just unaccept my answer? That's not a good way to get peope to help you in the future. — pguardiario, Oct 18 '12 at 14:03
I am new to this site and don't know how to behave properly. Sorry. :( You gave the best answer of all and I accepted it. Then, basing on your (and others') answers, I updated the code and posted it here for other people to see. Obviously, that corrected code was the final answer, therefore I pushed the "accept as answer" button again. I didn't know, that this action would deselect the previously selected answer (yours). Strange, that this site doesn't give an option to select a few good correct answers. Thanks for telling me. Now as I have no other choice, I will gladly re-accept your answer! — Mikko Vedru, Oct 19 '12 at 16:44

ROR/Hpricot: parsing a site and searching/comparing strings with regex

4 Answers4