0

I am moving some of my scraping from JavaScript to Ruby, and I am having trouble using Nokogiri.

I have trouble getting the right <dl> in a target class. I tried using css and xpath with the same result.

This is a sample of the HTML:

<div class="target">
  <dl>
    <dt>A:</dt>
    <dd>foo</dd> 
    <dt>B:</dt>
    <dd>bar</dd>
  </dl>
</div>

This is a sample of my code:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open(url))
doc.css(".target > dl").each do |item|
  puts item.text # I would expect to receive a collection of nodes here,
                  # yet I am receiving a single block of text
end

doc.css(".target > dl > dt").each do |item|
  puts item.text # Here I would expect to iterate through a collection of
                  # dt elements, however I receive a single block of text
end

Can someone show me what I am doing wrong?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • When asking, show us what you are receiving and explain why it's wrong. Don't show us only code. Without an example and explanation of why the output is wrong we're left guessing, based on experience of what is useful for us. – the Tin Man Nov 19 '15 at 23:22

2 Answers2

0

In the first case, the result should be the single dl; you are getting a single block of text. That is expected.

In the second case, the result should be two individual dt elements. You are printing their text one after another, which is indistinguishable from printing the text of the dl.

doc.css('.target > dl').length
# => 1 # as you have one `dl` element in `.target`

doc.css('.target > dl > dt').length
# => 2 # as you have two `dt` elements that are children of a `dl` in `.target`

doc.css(".target > dl > dt").each do |item|
  puts item.text
  puts "---" # make it obvious which element is which
end
# => A:
#    ---
#    B:
#    ---

I am not quite sure what other result you are expecting.

Amadan
  • 191,408
  • 23
  • 240
  • 301
0

I'd use something like:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div class="target">
  <dl>
    <dt>A:</dt>
    <dd>foo</dd> 
    <dt>B:</dt>
    <dd>bar</dd>
  </dl>
</div>
EOT

This finds the first class='target', then its contained <dt> tags, and extracts each <dt>'s text:

doc.at('.target').search('dt').map{ |n| n.text } # => ["A:", "B:"]

This does the same only passing the text to map as shorthand:

doc.at('.target').search('dt').map(&:text) # => ["A:", "B:"]

This lets the engine find all <dt> embedded in all class="target" tags:

doc.search('.target dt').map(&:text) # => ["A:", "B:"]

See "How to avoid joining all text from Nodes when scraping" also.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303