Nokogiri supports two main types of searches, search
and at
. search
returns a NodeSet, which you should think of like an array. at
returns a Node. Either can take a CSS or XPath expression. I prefer CSS since they're more readable, but sometimes you can't easily get where you want to be with one, so try the other.
For your question, it's important to specify the node you want to extract the text from, using text
. If your result is too broad you'll get text from between tags in addition to the text inside the tag you want. To avoid that drill down to the most-immediate node to what you're trying to read:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<release>
<artists>
<artist>
<name>Johnny Mnemonic</name>
</artist>
<artist>
<name>Constantine</name>
</artist>
<artists>
<release>
EOT
Because these look for the name
node specifically, the text desired is easy to get without garbage:
doc.at('name').text # => "Johnny Mnemonic"
doc.at('artist name').text # => "Johnny Mnemonic"
doc.at('artists artist name').text # => "Johnny Mnemonic"
These are looser searches so more junk is returned:
doc.at('artist').text # => "\n Johnny Mnemonic\n "
doc.at('artists').text # => "\n \n Johnny Mnemonic\n \n \n Constantine\n \n \n\n"
Using search
returns multiple nodes:
doc.search('name').map(&:text)
[
[0] "Johnny Mnemonic",
[1] "Constantine"
]
doc.search('artist').map(&:text)
[
[0] "\n Johnny Mnemonic\n ",
[1] "\n Constantine\n "
]
The only real difference between search
and at
is that at
is like search(...).first
.
See "How to avoid joining all text from Nodes when scraping" also.
Nokogiri has some additional aliases for convenience: at_css
and css
, and at_xpath
and xpath
.
Here are alternate ways, using CSS and XPath accessors to get at the names, clipped from Pry:
[5] (pry) main: 0> # using CSS with Ruby
[6] (pry) main: 0> artists = doc.search('release').map{ |release| release.at('artist').text.strip }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[7] (pry) main: 0> # using CSS with less Ruby
[8] (pry) main: 0> artists = doc.search('release artists artist:nth-child(1) name').map{ |n| n.text }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[9] (pry) main: 0>
[10] (pry) main: 0> # using XPath
[11] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[12] (pry) main: 0> # using more XPath
[13] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name/text()').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]