0

I run the following successfully:

require 'nokogiri'
require 'open-uri'

own = Nokogiri::HTML(open('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001513362'))
own_table = own.css('table#transaction-report')

p own_table.css('tr').css('td')[4].css('a').attr('href').value

=> "/Archives/edgar/data/0001513362/000162828016019444/0001628280-16-019444-index.htm"

However, when I try to use the element above in a block (as shown in code below), I get a NoMethodError for nil:NilClass.

I am confused, because I thought that the local variable link in the block would be the same object as in the code above.

Furthermore, if I change the definition of link below to:

link = row.css('td')[4].class

I get a hash without error, saying the value of link is Nokogiri::XML::Element.

Can anyone explain, why I have a Nokogiri::XML::Element object, but cannot run the css method on it. Especially when I can run it in the first snippet?

require 'nokogiri'
require 'open-uri'

own = Nokogiri::HTML(open('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001513362'))
own_table = own.css('table#transaction-report')


own_table.css('tr').each do |row|
  names = [:acq, :transaction_date, :execution_date, :issuer, :form, :transaction_type, :direct_or_indirect_ownership, :number_of_securities_transacted, :number_of_securities_owned, :line_number, :issuer_cik, :security_name, :url]
  values = row.css('td').map(&:text)
  link = row.css('td')[4].css('a').attr('href').value
  values << link
  hash = Hash[names.zip values]
  puts hash
end

secown.rb:11:in `block in <main>': undefined method `css' for nil:NilClass (NoMethodError)
    from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:187:in `block in each'
    from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `upto'
    from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `each'
    from secown.rb:8:in `<main>'
PiperWarrior
  • 191
  • 1
  • 13
  • 1
    Please read "[mcve]". When asking about a problem with code we need the minimum input data (HTML in this case) that demonstrates the problem in the question itself. Don't ask us to go to a site and read through an entire page as it slows down our response time for you and affects our ability to help others. You should never need to chain each tag using `css` or `search`. Instead use more complex selectors that jump from landmark to landmark to the target in the markup. That is less fragile. Also, you should wait longer before selecting an answer. – the Tin Man Sep 28 '16 at 22:03

2 Answers2

1

The crucial insight is that in the first case, own_table.css('tr') returns a NodeSet, .css('td') finds all the td that is descendant to any nodes in that nodeset, then finds the fourth one (speaking as a programmer, fifth for normal people :P ).

The second snippet treats each row individually as a Node, then finds all descendant td, then picks the fourth one.

So if you have this structure:

tr id=1
  td id=2
  td id=3
tr id=4
  td id=5
  td id=6
  td id=7
  td id=8
  td id=9

then the first snippet will give you the id 7 td (it being the fourth td in all tr); the second snippet would try to find the fourth td in id 1 tr, then fourth td in id 4 tr, but it errors out because id 1 tr doesn't have a fourth td.

Edit: Specifically, having checked your URL, the first tr has no td; all the others have 12. So own_table.css('tr')[0].css('td')[4].class is NilClass, not Nokogiri::XML::Element as you report.

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • I meant changing local variable link in block to link = row.css('td')[4].class gives Nokogiri::XML::Element – PiperWarrior Sep 28 '16 at 02:34
  • 1
    I know what you meant. It does for 80 rows (the ones with 12 `td`); not for the first one (the one with no `td`). – Amadan Sep 28 '16 at 02:36
  • 1
    @Amadan is right, the first row has a `th` instead of `td` that's because you getting `NoMethodError` `values = row.css('td').map(&:text) => []` – retgoat Sep 28 '16 at 02:39
  • @Amadan I ignored the first row by using each_with_index and iterating only when index != 0. That worked. – PiperWarrior Sep 28 '16 at 02:49
1

Consider this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div><span><p>foo</p></span></div>
    <div id="bar"><span><p>bar</p></span></div>
  </body>
</html>
EOT

If I chain the methods I'm going to find all matching <p> nodes inside the <div>s:

doc.css('div').css('span').css('p').to_html
# => "<p>foo</p><p>bar</p>"

or:

doc.css('div').css('p').to_html
# => "<p>foo</p><p>bar</p>"

That's equivalent to using the following selectors, only they are a bit more efficient as they don't involve calling libXML multiple times:

doc.css('div span p').to_html
# => "<p>foo</p><p>bar</p>"

or:

doc.css('div p').to_html
# => "<p>foo</p><p>bar</p>"

Really you should find landmarks in the target markup and leapfrog from one to the next, not step from tag to tag:

doc.css('#bar p').to_html
# => "<p>bar</p>"

If your intention was to find all matches then replace #bar with div in the above selector and it'll loosen the search.

Finally, if your goal is to extract the text of a set of nodes, you don't want to use something like:

doc.css('bar p').text

css, like search and xpath returns a NodeSet and text will concatenate the text from all returned nodes, making it difficult to retrieve the text from the individual nodes. Instead use:

doc.css('bar p').map(&:text)

which will return an array containing the text of each node found:

doc.css('div p').text
# => "foobar"

versus:

doc.css('div p').map(&:text)
# => ["foo", "bar"]

See "How to avoid joining all text from Nodes when scraping" also.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Also, in the code above, doc.css('div').css('span').css('p').to_html does not work as specified (it returns an empty string). If you remove, css('span'), it does. – PiperWarrior Sep 29 '16 at 00:56
  • Very instructive post. The code does look cleaner and does not repeat when rid of chaining. – PiperWarrior Sep 29 '16 at 01:15
  • `If you remove, css('span')`... That's because I forgot to update the example to the version with the `span` tags. I'll update it. – the Tin Man Sep 30 '16 at 16:30