0

I'm attempting to create a simple web scraper, but I'm having some trouble.

The structure of the website is like this:

<tr>
    <td class="gametime"><a href="/facilities/22/games?exact_date=15-01-18">Sun 01-18-15 09:10 PM</a></td>
    <td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
    <td><a href="/facilities/22/teams/210190">TIGERS</a></td>
</tr>
<tr>
    <td class="gametime"><a href="/facilities/22/games?exact_date=15-01-25">Sun 01-25-15 06:40 PM</a></td>
    <td class="gamehome"><a href="/facilities/22/teams/208345">LIONS</a></td>
    <td><a href="/facilities/22/teams/208362">CYCLONES</a></td>
</tr>
<tr>
    <td class="gametime"><a href="/facilities/22/games?exact_date=15-02-01">Sun 02-01-15 12:50 PM</a></td>
    <td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
    <td><a href="/facilities/22/teams/210041">CLAY</a></td>
</tr>

What I currently have is this:

games = page.css("td[class='gametime']").map{|game| game.parent.css("a").text}

This returns an array of strings with three elements (in this example). But what I am attempting to get is a 2D array, where, for example:

games[0][0] #=> Sun 01-18-15 09:10 PM
games[0][1] #=> CYCLONES
games[0][2] #=> TIGERS

I do not want this (what I currently am getting):

games[0] #=> Sun 01-18-15 09:10 PMCYCLONESTIGERS

What is the best approach to achieve this?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
AXG1010
  • 1,872
  • 3
  • 21
  • 35

3 Answers3

1

You were close:

games = page.css("td.gametime").map { |i| i.parent.css("a").map { |j| j.text } }

For each td.gametime, go to its parent and grab all a tags then map them to their text. This will give you an array of three values for each game, and an array of arrays for the page.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Tamer Shlash
  • 9,314
  • 5
  • 44
  • 82
0

I don't think text is going to make an array for you. I think you'll need to nest map statements:

games = page.css("td[class='gametime']").map{|game| game.parent.css("a").map(&:text)}
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
0

I'd do it like this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<tr>
    <td class="gametime"><a href="/facilities/22/games?exact_date=15-01-18">Sun 01-18-15 09:10 PM</a></td>
    <td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
    <td><a href="/facilities/22/teams/210190">TIGERS</a></td>
</tr>
<tr>
    <td class="gametime"><a href="/facilities/22/games?exact_date=15-01-25">Sun 01-25-15 06:40 PM</a></td>
    <td class="gamehome"><a href="/facilities/22/teams/208345">LIONS</a></td>
    <td><a href="/facilities/22/teams/208362">CYCLONES</a></td>
</tr>
<tr>
    <td class="gametime"><a href="/facilities/22/games?exact_date=15-02-01">Sun 02-01-15 12:50 PM</a></td>
    <td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
    <td><a href="/facilities/22/teams/210041">CLAY</a></td>
</tr>
EOT

Here's the code:

games = doc.search('tr').map{ |tr| tr.search('td').map(&:text) }
# => [["Sun 01-18-15 09:10 PM", "CYCLONES", "TIGERS"],
#     ["Sun 01-25-15 06:40 PM", "LIONS", "CYCLONES"],
#     ["Sun 02-01-15 12:50 PM", "CYCLONES", "CLAY"]]
games[0][0] # => "Sun 01-18-15 09:10 PM"
games[0][1] # => "CYCLONES"
games[0][2] # => "TIGERS"

It isn't necessary to grab the inner tags inside the <td> tags for this HTML. Sometimes there's additional text to be ignored, which would make it necessary, but since it's simple, the code can be simple. text for the <td> nodes will return the text nodes embedded inside them.

I seriously doubt the HTML they're serving is that plain, and without a bit more detail I can't give a more accurate answer. (It behooves/benefits you to provide adequately detailed and accurate input.) The general idea though, is to find the table containing the rows you want, then drill down:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<table class="foo">
<tr>
    <td class="gametime"><a href="/facilities/22/games?exact_date=15-01-18">Sun 01-18-15 09:10 PM</a></td>
    <td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
    <td><a href="/facilities/22/teams/210190">TIGERS</a></td>
</tr>
<tr>
    <td class="gametime"><a href="/facilities/22/games?exact_date=15-01-25">Sun 01-25-15 06:40 PM</a></td>
    <td class="gamehome"><a href="/facilities/22/teams/208345">LIONS</a></td>
    <td><a href="/facilities/22/teams/208362">CYCLONES</a></td>
</tr>
<tr>
    <td class="gametime"><a href="/facilities/22/games?exact_date=15-02-01">Sun 02-01-15 12:50 PM</a></td>
    <td class="gamehome"><a href="/facilities/22/teams/208362">CYCLONES</a></td>
    <td><a href="/facilities/22/teams/210041">CLAY</a></td>
</tr>
</table>
<table class="bar">
</table>
EOT

And the modified code:

games = doc.search('table.foo tr').map{ |tr| tr.search('td').map(&:text) }
# => [["Sun 01-18-15 09:10 PM", "CYCLONES", "TIGERS"],
#     ["Sun 01-25-15 06:40 PM", "LIONS", "CYCLONES"],
#     ["Sun 02-01-15 12:50 PM", "CYCLONES", "CLAY"]]
games[0][0] # => "Sun 01-18-15 09:10 PM"
games[0][1] # => "CYCLONES"
games[0][2] # => "TIGERS"
the Tin Man
  • 158,662
  • 42
  • 215
  • 303