You're trying to find the text node that follows the <span>
. That's easily done once you know where the <span>
is:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<li class="product--tax is-left">
<span class="label--purchase-unit">Inhalt:</span>
0,75 l
</li>
EOT
These are various ways to get there:
doc.at('li span').next_sibling.text.strip # => "0,75 l"
doc.at('li.product--tax span.label--purchase-unit').next_sibling.text.strip # => "0,75 l"
doc.at('.label--purchase-unit').next_sibling.text.strip # => "0,75 l"
doc.at('span.label--purchase-unit').next_sibling.text.strip # => "0,75 l"
Moving on...
doc.search("[text()*='Inhalt:']").parent.xpath('text()')
is a bad way to go about trying to find the node:
search
returns a NodeSet, which would be all matching nodes in the document. While this particular use might only have one occurrence of "Inhalt:", in another document that has multiple instances of the target word you'd get multiple hits and would get a garbage result.
parent
isn't a method of NodeSet, so that'd blow up.
parent.xpath
isn't a good way of continuing a selector. Instead, to accomplish that in XPath you should use something like:
[text()*='Inhalt:']/../text()
..
means move to the parent of the current node in XPath-lingo. That's off the top of my head but looks right.
Why do you use at instead of .css or .xpath?
at
is equivalent to search('some_selector').first
, so it's the shorthand to find the first occurrence of that selector. at
and search
are generic methods, taking either XPath or CSS, and rely on some heuristics to determine whether the selector is an XPath or a CSS string. They can be fooled but most of the time they're completely safe and more convenient than their xpath
, css
, at_xpath
or at_css
variants.
If the markup could have multiple nodes you want to identify then adjust the use of at
and search
accordingly.
There is a point of confusion we see people fall over regularly. at
and the at_*
variants return a Node, and search
and its xpath
and css
variants return a NodeSet. When trying to extract text from the search text
will do something unexpected. Meditate on this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "foo"
doc.search('p').map(&:text) # => ["foo", "bar"]
This behavior is documented but people rarely read that information, then try to figure out how to recover the text of the two nodes after it's been mangled.
See "How to avoid joining all text from Nodes when scraping" also.