Based on the comment:
I think you could be right, I have looked at the HTML at the remote webpage and they are adding a wrapping around every table row which is assigned a class. I wonder if there is any way of getting the script to skip empty rows?
If you're seeing an HTML structure like:
<table>
<tbody>
<tr>
<tr>
<td>time</td>
<td>source</td>
<td>destination</td>
<td>duration</td>
</tr>
</tr>
</tbody>
</table>
Then this will show the problem:
require 'nokogiri'
require 'pp'
html = '<table><tbody><tr><tr><td>time</td><td>source</td><td>destination</td><td>duration</td></tr></tr></tbody></table>'
doc = Nokogiri::HTML(html)
page = doc.search("table tbody tr").each do |row|
time = row.css("td:nth-child(1)").text.strip
source = row.css("td:nth-child(2)").text.strip
destination = row.css("td:nth-child(3)").text.strip
duration = row.css("td:nth-child(4)").text.strip
hash = {
:time => time,
:source => source,
:destination => destination,
:duration => duration
}
pp hash
end
That outputs:
{:time=>"", :source=>"", :destination=>"", :duration=>""}
{:time=>"time",
:source=>"source",
:destination=>"destination",
:duration=>"duration"}
The reason you are getting the blank rows is because the HTML is malformed. The outside <tr>
shouldn't be there. The fix is easy and will work with HTML that is correct also.
Also, the inner css
access is not quite correct, but why that is so is subtle. I'll get to that.
To fix the first, we'll add a conditional test:
page = doc.search("table tbody tr").each do |row|
becomes:
page = doc.search("table tbody tr").each do |row|
next if (!row.at('td'))
After running, the output is now:
{:time=>"time",
:source=>"source",
:destination=>"destination",
:duration=>"duration"}
That's really all you need to fix the problem, but there are some things in the code that are doing things the hard way which requires some 'splainin', but first here's the code change:
From:
time = row.css("td:nth-child(1)").text.strip
source = row.css("td:nth-child(2)").text.strip
destination = row.css("td:nth-child(3)").text.strip
duration = row.css("td:nth-child(4)").text.strip
Change to:
time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
Running that code outputs what you want:
{:time=>"time",
:source=>"source",
:destination=>"destination",
:duration=>"duration"}
so things are hunky-dory still.
Here's the problem with your original code:
css
is an alias to search
. Nokogiri returns a NodeSet for both. text
will return an empty string from an empty NodeSet, which you'd get for each of the row.css("td:nth-child(...)").text.strip
calls that looked at the outer <tr>
. So, Nokogiri was failing to do what you wanted silently, because it was syntactically correct and logically correct given what you told it to do; It just failed to meet your expectations.
Using at
, or one of its aliases, like css_at
, looks for the first matching accessor. So, theoretically we could continue to use row.at("td:nth-child(1)").text.strip
with multiple assignments for each accessor, and that would have immediately revealed you had a problem with the HTML because the text
would have blown up. But that's not zen-like enough.
Instead, we can iterate over the cells returned in the NodeSet using map
and let it gather the needed cell contents and strip them, then do a parallel assignment to the variables:
time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
Again, running this:
require 'nokogiri'
require 'pp'
html = '<table><tbody><tr><tr><td>time</td><td>source</td><td>destination</td><td>duration</td></tr></tr></tbody></table>'
doc = Nokogiri::HTML(html)
page = doc.search("table tbody tr").each do |row|
next if (!row.at('td'))
time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
hash = {
:time => time,
:source => source,
:destination => destination,
:duration => duration
}
pp hash
end
Gives me:
{:time=>"time",
:source=>"source",
:destination=>"destination",
:duration=>"duration"}
Retrofit that into your code and you get:
page = agent.page.search("table tbody tr").each do |row|
next if (!row.at('td'))
time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
Call.create!(:time => time, :source => source, :destination => destination, :duration => duration)
end
And you probably don't need the page =
.