I'm an amateur programmer wanting to scrape data from a site that is similar to this site: http://www.highschoolsports.net/massey/ (I have permission to scrape the site, by the way.)
The target site has 'th' classes for each 'th' in row[0] but I want to ensure that each 'TD' I pull from each table is somehow linked to that th's class name, because the tables are inconsistent, for example one table might be:
row[0] - >>th.name, th.place, th.team
row[1] - >>td[0], td[1] , td[2]
while another might be:
row[0] - >>th.place, th.team, th.name
row[1] - >>td[0], td[1] , td[2] etc..
My Question: How do I capture the 'th' class name across many hundreds of tables which are inconsistent(in 'th' class order) and create the 10-14 variables(arrays), then link the 'td' corresponding to that column in the table to that dynamic variable? Please let me know if this is confusing.. there are multiple tables on a given page..
Currently my code is something like:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'uri'
class Result
def initialize(row)
@attrs = {}
@attrs[:raw] = row.text
end
end
class Race
def initialize(page, table)
@handle = page
@table = table
@results = []
@attrs = {}
parse!
end
def parse!
@attrs[:name] = @handle.css('div.caption').text
get_rows
end
def get_rows
# get all of the rows ..
@handle.css('tr').each do |tr|
@results << RaceResult.new(tr)
end
end
end
class Event
class << self
def all(index_url)
events = []
ourl = Nokogiri::HTML(open(index_url))
ourl.css('a.event').each do |url|
abs_url = MAIN + url.attributes["href"]
events << Event.new(abs_url)
end
events
end
end
def initialize(url)
@url = url
@handle = nil
@attrs = {}
@races = []
@sub_events = []
parse!
end
def parse!
@handle = Nokogiri::HTML(open(@url))
get_page_meta
if(@handle.css('table.base.event_results').length > 0)
@handle.search('div.table_container.event_results').each do |table|
@races << Race.new(@handle, table)
end
else
@handle.css('div.centered a.obvious').each do |ol|
@sub_events << Event.new(BASE_URL + ol.attributes["href"])
end
end
end
def get_page_meta
@attrs[:name] = @handle.css('html body div.content h2 text()')[0] # event name
@attrs[:date] = @handle.xpath("html/body/div/div/text()[2]").text.strip #date
end
end
A friend has been helping me with this and I'm just starting to get a grasp on OOP but I'm only capture the tables and they're not split into td's and stored into some kind of variable/array/hash etc.. I need help understanding this process or how to do this. The critical piece would be dynamically assigning variable names according to the classes of the data and moving the 'td's' from that column (all td[2]'s for example) into that dynamic variable name. I can't tell you how amazing it would be if someone actually could help me solve this problem and understand how to make this work. Thank you in advance for any help!