0

I am trying to get the text within a <li> element that I identify with help of span elements that contain "Inhalt:".

The text I want to get is "0,75l".

This is the HTML code:

<li class="product--tax is-left">
   <span class="label--purchase-unit">Inhalt:</span>
0,75 l #text I want to get
</li>

If been trying this, however it doesn't seem to work:

doc.search("[text()*='Inhalt:']").parent.xpath('text()')
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
AaronDT
  • 3,940
  • 8
  • 31
  • 71

1 Answers1

1

You're trying to find the text node that follows the <span>. That's easily done once you know where the <span> is:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)

<li class="product--tax is-left">
   <span class="label--purchase-unit">Inhalt:</span>
0,75 l
</li>
EOT

These are various ways to get there:

doc.at('li span').next_sibling.text.strip # => "0,75 l"
doc.at('li.product--tax span.label--purchase-unit').next_sibling.text.strip # => "0,75 l"
doc.at('.label--purchase-unit').next_sibling.text.strip # => "0,75 l"
doc.at('span.label--purchase-unit').next_sibling.text.strip # => "0,75 l"

Moving on...

doc.search("[text()*='Inhalt:']").parent.xpath('text()')

is a bad way to go about trying to find the node:

  • search returns a NodeSet, which would be all matching nodes in the document. While this particular use might only have one occurrence of "Inhalt:", in another document that has multiple instances of the target word you'd get multiple hits and would get a garbage result.
  • parent isn't a method of NodeSet, so that'd blow up.
  • parent.xpath isn't a good way of continuing a selector. Instead, to accomplish that in XPath you should use something like:

    [text()*='Inhalt:']/../text()
    

    .. means move to the parent of the current node in XPath-lingo. That's off the top of my head but looks right.


Why do you use at instead of .css or .xpath?

at is equivalent to search('some_selector').first, so it's the shorthand to find the first occurrence of that selector. at and search are generic methods, taking either XPath or CSS, and rely on some heuristics to determine whether the selector is an XPath or a CSS string. They can be fooled but most of the time they're completely safe and more convenient than their xpath, css, at_xpath or at_css variants.

If the markup could have multiple nodes you want to identify then adjust the use of at and search accordingly.

There is a point of confusion we see people fall over regularly. at and the at_* variants return a Node, and search and its xpath and css variants return a NodeSet. When trying to extract text from the search text will do something unexpected. Meditate on this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT

doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"

doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "foo"

doc.search('p').map(&:text) # => ["foo", "bar"]

This behavior is documented but people rarely read that information, then try to figure out how to recover the text of the two nodes after it's been mangled.

See "How to avoid joining all text from Nodes when scraping" also.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Awesome! Thank you very much! Why do you use at instead of .css or .xpath? :) – AaronDT Oct 07 '16 at 19:37
  • Glad it helped. The idea on Stack Overflow is we're writing an on-line reference book, so good questions and answers help the rest of the programming world. Not only does it help you, it helps the next person looking for a solution to a similar problem. Keep asking good questions, then start answering them and teaching them what you've learned. – the Tin Man Oct 07 '16 at 19:56