I just started with Ruby On Rails, and want to create a simple web site crawler which:
- Goes through all the Sherdog fighters' profiles.
- Gets the Referees' names.
- Compares names with the old ones (both during the site parsing and from the file).
- Prints and saves all the unique names to the file.
An example URL is: http://www.sherdog.com/fighter/Fedor-Emelianenko-1500
I am searching for the tag entries like <span class="sub_line">Dan Miragliotta</span>
, unfortunately, additionally to the proper Referee names I need, the same kind of class is used with:
- The date.
- "N/A" when the referee name is not known.
I need to discard all the results with a "N/A" string as well as any string which contains numbers. I managed to do the first part but couldn't figure out how to do the second. I tried searching, thinking and experimenting, but, after experimenting and rewriting, managed to break the whole program and don't know how to (properly) fix it:
require 'rubygems'
require 'hpricot'
require 'simplecrawler'
# Set up a new crawler
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]
# The crawler yields a Document object for each visited page.
sc.crawl { |document|
# Parse page title with Hpricot and print it
hdoc = Hpricot(document.data)
(hdoc/"td/span[@class='sub_line']").each do |span|
if span.inner_html == 'N/A' || Regexp.new(".*/\d\.*$").match(span.inner_html)
# puts "Test"
else
puts span.inner_html
#File.open("File_name.txt", 'a') {|f| f.puts(hdoc.span.inner_html) }
end
end
}
I would also appreciate help with ideas on the rest of the program: How do I properly read the current names from the file, if the program is run multiple times, and how do I make the comparisons for the unique names?
Edit:
After some proposed improvements, here is what I got:
require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
#require 'open-uri'
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
names = doc.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
puts names
}
Unfortunately, the code still doesn't work - it returns a blank.
If instead of doc = Nokogiri::HTML(document.data)
, I write doc = Nokogiri::HTML(open(document.data))
, then it gives me the whole page, but, parsing still doesn't work.