0

I have the following code, thanks to another SO question/answer:

page = agent.page.search("table tbody tr").each do |row|
  time        = row.css("td:nth-child(1)").text.strip
  source      = row.css("td:nth-child(2)").text.strip
  destination = row.css("td:nth-child(3)").text.strip
  duration    = row.css("td:nth-child(4)").text.strip
  Call.create!(:time => time, :source => source, :destination => destination, :duration => duration)
end

It's working well and when I run the rake task it correctly puts the data into the correct table row in my Rails application, however, for some reason after successfully creating a record for a row it's also creating a blank record.

I can't figure it out. From the looks of the code it's issuing the create! command within each row.

You can see the full rake task at https://gist.github.com/1574942 and the other question leading to this code is "Parse html into Rails without new record every time?".

Community
  • 1
  • 1
dannymcc
  • 3,744
  • 12
  • 52
  • 85
  • I'd suspect that there is something in the HTML that is causing intermediate loops, but not populating the values, such as additional rows with no content. – the Tin Man Jan 12 '12 at 01:42
  • I think you could be right, I have looked at the HTML at the remote webpage and they are adding a wrapping around every table row which is assigned a class. I wonder if there is any way of getting the script to skip empty rows? – dannymcc Jan 12 '12 at 14:40
  • Add a sample of the HTML you are seeing, containing two rows and the cells. We can nail the problem then. Without that, we're guessing. – the Tin Man Jan 12 '12 at 16:41
  • Belay that. See my answer below. – the Tin Man Jan 12 '12 at 17:32

1 Answers1

1

Based on the comment:

I think you could be right, I have looked at the HTML at the remote webpage and they are adding a wrapping around every table row which is assigned a class. I wonder if there is any way of getting the script to skip empty rows?

If you're seeing an HTML structure like:

<table>
  <tbody>
    <tr>
      <tr>
        <td>time</td>
        <td>source</td>
        <td>destination</td>
        <td>duration</td>
      </tr>
    </tr>
  </tbody>
</table>

Then this will show the problem:

require 'nokogiri'
require 'pp'

html = '<table><tbody><tr><tr><td>time</td><td>source</td><td>destination</td><td>duration</td></tr></tr></tbody></table>'
doc = Nokogiri::HTML(html)
page = doc.search("table tbody tr").each do |row|
  time        = row.css("td:nth-child(1)").text.strip
  source      = row.css("td:nth-child(2)").text.strip
  destination = row.css("td:nth-child(3)").text.strip
  duration    = row.css("td:nth-child(4)").text.strip
  hash = {
    :time        => time,
    :source      => source,
    :destination => destination,
    :duration    => duration 
  }
  pp hash
end

That outputs:

{:time=>"", :source=>"", :destination=>"", :duration=>""}
{:time=>"time",
 :source=>"source",
 :destination=>"destination",
 :duration=>"duration"}

The reason you are getting the blank rows is because the HTML is malformed. The outside <tr> shouldn't be there. The fix is easy and will work with HTML that is correct also.

Also, the inner css access is not quite correct, but why that is so is subtle. I'll get to that.

To fix the first, we'll add a conditional test:

page = doc.search("table tbody tr").each do |row|

becomes:

page = doc.search("table tbody tr").each do |row|
  next if (!row.at('td'))

After running, the output is now:

{:time=>"time",
 :source=>"source",
 :destination=>"destination",
 :duration=>"duration"}

That's really all you need to fix the problem, but there are some things in the code that are doing things the hard way which requires some 'splainin', but first here's the code change:

From:

time        = row.css("td:nth-child(1)").text.strip
source      = row.css("td:nth-child(2)").text.strip
destination = row.css("td:nth-child(3)").text.strip
duration    = row.css("td:nth-child(4)").text.strip

Change to:

time, source, destination, duration = row.search('td').map{ |td| td.text.strip }

Running that code outputs what you want:

{:time=>"time",
 :source=>"source",
 :destination=>"destination",
 :duration=>"duration"}

so things are hunky-dory still.

Here's the problem with your original code:

css is an alias to search. Nokogiri returns a NodeSet for both. text will return an empty string from an empty NodeSet, which you'd get for each of the row.css("td:nth-child(...)").text.strip calls that looked at the outer <tr>. So, Nokogiri was failing to do what you wanted silently, because it was syntactically correct and logically correct given what you told it to do; It just failed to meet your expectations.

Using at, or one of its aliases, like css_at, looks for the first matching accessor. So, theoretically we could continue to use row.at("td:nth-child(1)").text.strip with multiple assignments for each accessor, and that would have immediately revealed you had a problem with the HTML because the text would have blown up. But that's not zen-like enough.

Instead, we can iterate over the cells returned in the NodeSet using map and let it gather the needed cell contents and strip them, then do a parallel assignment to the variables:

time, source, destination, duration = row.search('td').map{ |td| td.text.strip }

Again, running this:

require 'nokogiri'
require 'pp'

html = '<table><tbody><tr><tr><td>time</td><td>source</td><td>destination</td><td>duration</td></tr></tr></tbody></table>'
doc = Nokogiri::HTML(html)
page = doc.search("table tbody tr").each do |row|
  next if (!row.at('td'))

  time, source, destination, duration = row.search('td').map{ |td| td.text.strip }

  hash = {
    :time        => time,
    :source      => source,
    :destination => destination,
    :duration    => duration 
  }
  pp hash
end

Gives me:

{:time=>"time",
 :source=>"source",
 :destination=>"destination",
 :duration=>"duration"}

Retrofit that into your code and you get:

page = agent.page.search("table tbody tr").each do |row|
  next if (!row.at('td'))
  time, source, destination, duration = row.search('td').map{ |td| td.text.strip }
  Call.create!(:time => time, :source => source, :destination => destination, :duration => duration)
end

And you probably don't need the page =.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303