0

I'm an amateur programmer wanting to scrape data from a site that is similar to this site: http://www.highschoolsports.net/massey/ (I have permission to scrape the site, by the way.)

The target site has 'th' classes for each 'th' in row[0] but I want to ensure that each 'TD' I pull from each table is somehow linked to that th's class name, because the tables are inconsistent, for example one table might be:

row[0] - >>th.name, th.place, th.team

row[1] - >>td[0], td[1] , td[2]

while another might be:

row[0] - >>th.place, th.team, th.name

row[1] - >>td[0], td[1] , td[2] etc..

My Question: How do I capture the 'th' class name across many hundreds of tables which are inconsistent(in 'th' class order) and create the 10-14 variables(arrays), then link the 'td' corresponding to that column in the table to that dynamic variable? Please let me know if this is confusing.. there are multiple tables on a given page..

Currently my code is something like:

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'uri'

class Result

  def initialize(row)
    @attrs = {}
    @attrs[:raw] = row.text
  end

end

class Race

  def initialize(page, table)
    @handle = page
    @table = table
    @results = []
    @attrs = {}
    parse!
  end

  def parse!
    @attrs[:name] = @handle.css('div.caption').text
    get_rows

  end

  def get_rows
    # get all of the rows .. 
    @handle.css('tr').each do |tr|
      @results << RaceResult.new(tr)
    end
  end

end

class Event

  class << self

    def all(index_url)
      events = []
      ourl = Nokogiri::HTML(open(index_url))
      ourl.css('a.event').each do |url|
        abs_url = MAIN + url.attributes["href"]
        events << Event.new(abs_url)
      end
      events
    end

  end

  def initialize(url)
    @url = url
    @handle = nil
    @attrs = {}
    @races = []
    @sub_events = []
    parse!
  end

  def parse!
    @handle = Nokogiri::HTML(open(@url))
    get_page_meta
    if(@handle.css('table.base.event_results').length > 0)
      @handle.search('div.table_container.event_results').each do |table|
        @races << Race.new(@handle, table)
      end
    else
      @handle.css('div.centered a.obvious').each do |ol|
        @sub_events << Event.new(BASE_URL + ol.attributes["href"])
      end
    end
  end

  def get_page_meta
    @attrs[:name] = @handle.css('html body div.content h2 text()')[0] # event name
    @attrs[:date] = @handle.xpath("html/body/div/div/text()[2]").text.strip #date
  end

end

A friend has been helping me with this and I'm just starting to get a grasp on OOP but I'm only capture the tables and they're not split into td's and stored into some kind of variable/array/hash etc.. I need help understanding this process or how to do this. The critical piece would be dynamically assigning variable names according to the classes of the data and moving the 'td's' from that column (all td[2]'s for example) into that dynamic variable name. I can't tell you how amazing it would be if someone actually could help me solve this problem and understand how to make this work. Thank you in advance for any help!

1 Answers1

0

It's easy once you realize that the th contents are the keys of your hash. Example:

@items = []
doc.css('table.masseyStyleTable').each do |table|
    fields = table.css('th').map{|x| x.text.strip}
    table.css('tr').each do |tr|
        item = {}
        fields.each_with_index do |field,i|
            item[field] = tr.css('td')[i].text.strip rescue ''
        end
        @items << item      
    end
end
pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • pguardiario, You're fantastic! I came up with the following my self: ` def get_results rows = [] rows = @table.css('tr') header = rows.shift puts @attrs[:theads] = header.text rows = rows.map do |row| _row = {} header.each_with_index do |h, i| end _row end end` I'm not sure how to view this now, maybe $log.info "{#@items[item]}" ?? How would you unload this into DataMapper? I really appreciate it, you're great! Thanks again! – user1010100 Oct 27 '11 at 08:39
  • I'm not sure about DM but in ActiveRecord as long as the fields match your keys you would just do MyTable.new(item).save - Don't forget to accept my answer – pguardiario Oct 27 '11 at 09:18