3

I have data that looks like:

<release> 
 <artists>
  <artist>
   <name>Johnny Mnemonic</name>
  </artist>
  <artist>
    <name>Constantine</name>
  </artist>
 <artists>
</release>
<release>
 <artists>
  <artist>
   <name>Speed</name>
  </artist>
  <artist>
    <name>The Matrix</name>
  </artist>
 <artists>
 </release>
 ...and so on.

For each release I want only the data from the first <artist> tag. I tried the following code but it pulls all text from the artists:

page = Nokogiri::XML(open("37.xml"))

page.xpath("//artists[1]").each do |el|

File.open("#{LOCAL_DIR}/37.txt", 'a'){|f| f.write(el)}
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
user1596069
  • 65
  • 1
  • 4

2 Answers2

9

Nokogiri supports two main types of searches, search and at. search returns a NodeSet, which you should think of like an array. at returns a Node. Either can take a CSS or XPath expression. I prefer CSS since they're more readable, but sometimes you can't easily get where you want to be with one, so try the other.

For your question, it's important to specify the node you want to extract the text from, using text. If your result is too broad you'll get text from between tags in addition to the text inside the tag you want. To avoid that drill down to the most-immediate node to what you're trying to read:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<release> 
<artists>
  <artist>
  <name>Johnny Mnemonic</name>
  </artist>
  <artist>
    <name>Constantine</name>
  </artist>
<artists>
<release>
EOT

Because these look for the name node specifically, the text desired is easy to get without garbage:

doc.at('name').text                # => "Johnny Mnemonic"
doc.at('artist name').text         # => "Johnny Mnemonic"
doc.at('artists artist name').text # => "Johnny Mnemonic"

These are looser searches so more junk is returned:

doc.at('artist').text  # => "\n   Johnny Mnemonic\n  "
doc.at('artists').text # => "\n  \n   Johnny Mnemonic\n  \n  \n    Constantine\n  \n \n\n"

Using search returns multiple nodes:

doc.search('name').map(&:text)

[
    [0] "Johnny Mnemonic",
    [1] "Constantine"
]

doc.search('artist').map(&:text)

[
    [0] "\n   Johnny Mnemonic\n  ",
    [1] "\n    Constantine\n  "
]

The only real difference between search and at is that at is like search(...).first.

See "How to avoid joining all text from Nodes when scraping" also.

Nokogiri has some additional aliases for convenience: at_css and css, and at_xpath and xpath.


Here are alternate ways, using CSS and XPath accessors to get at the names, clipped from Pry:

[5] (pry) main: 0> # using CSS with Ruby
[6] (pry) main: 0> artists = doc.search('release').map{ |release| release.at('artist').text.strip }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
[7] (pry) main: 0> # using CSS with less Ruby
[8] (pry) main: 0> artists = doc.search('release artists artist:nth-child(1) name').map{ |n| n.text }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
[9] (pry) main: 0>
[10] (pry) main: 0> # using XPath
[11] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name').map{ |t| t.content }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
[12] (pry) main: 0> # using more XPath
[13] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name/text()').map{ |t| t.content }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thank you very much. The doc.at('name') seems to be what I want. One more question, could you show me how to repeat that over every node? – user1596069 Mar 19 '13 at 18:55
  • Ah. You didn't give us an accurate example of your data. `'name'` doesn't respect any of the containing nodes. Is `` significant causing you to break and do something special for each one? You should be able to figure that out based on the information I gave you. – the Tin Man Mar 19 '13 at 19:38
  • Sorry. Release is the first node with all of the rest as children under it. I wanted to pull the first 'name' data from each release, of which there are probably 10,000. So your code pulled the first name from the first release node, I want that to repeat over every , structured in an identical way to the first. – user1596069 Mar 19 '13 at 21:41
  • Again, your sample data needs to _show_ that. Please add an accurate example. Reduce it but it needs to show what you'll be working with. The fix to the code is easy but I want to make a change once, not again and again as more changes to the source data are revealed. – the Tin Man Mar 19 '13 at 23:02
  • Sorry, the code is edited. This is an example of two releases, of which there will be thousands. I want to pull the first 'name' data from each of the nodes. – user1596069 Mar 19 '13 at 23:27
  • I added some various ways of doing it. – the Tin Man Mar 21 '13 at 03:59
  • @theTinMan this is pretty much the best documentation/tutorial for this issue that is out there. Thank you. – MrVocabulary Feb 26 '20 at 16:49
0

Your xpath expression selects the <artists>, not each <artist> tag as you seem to expect.Try this:

doc.search('artists artist').map(&:text)

Your expression "//artists" will retrieve all 'artists' tags, the [1] selects the first of these tags, not the first element inside the tag itself.

ichigolas
  • 7,595
  • 27
  • 50