0

I am trying to search for a specific node in an XML file using XPath. This search worked just fine under REXML but REXML was too slow for large XML docs. So moved over to LibXML.

My simple example is processing a Yum repomd.xml file, an example can be found here: http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml

My test script is as follows:

require 'rubygems'
require 'libxml'

p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse

filelist = repomd.find_first("/repomd/data[@type='filelists']/location@href")
puts "Length: " + filelist.length.to_s
filelist.each do |f|
   puts f.attributes['href']
end

I get this error:

Error: Invalid expression.
/usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find': Error: Invalid expression. (LibXML::XML::Error)
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find'
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:130:in `find_first'
from /tmp/scripty.rb:6

I have also tried simpler examples like below, but still no dice.

p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse
filelist = repomd.root.find(".//location")
puts "Length: " + filelist.length.to_s

In the above case I get the output:

Length: 0

Your inspired guidance would be greatly appreciated, and I have searched for what I am doing wrong, and I just can't figure it out...

Here is some code that will fetch the file and process it, still doesn't work...

require 'rubygems'
require 'open-uri'
require 'libxml'

raw_xml = open('http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml').read
p = LibXML::XML::Parser.string(raw_xml)
repomd = p.parse
filelist = repomd.find_first("//data[@type='filelists']/location[@href]")
puts "First: " + filelist
Arup Rakshit
  • 116,827
  • 30
  • 260
  • 317

2 Answers2

1

In the end I reverted back to REXML and used stream processing. Much faster and much easier XPath syntax implementation.

0

Looking at your code,it seems you want to collect only those location elements which has href attribute. If that's the case below should work:

"//data[@type='filelists']/location[@href]"
Arup Rakshit
  • 116,827
  • 30
  • 260
  • 317
  • Unfortunately not, any further suggestions? The search is returning 'nil' – MediumDaveR Aug 20 '13 at 10:18
  • still nil :-( filelist = repomd.find_first("//data[contains(@type,'filelists')]/location[@href]") puts "First: " + filelist.string – MediumDaveR Aug 20 '13 at 10:28
  • @MediumDaveR Okay.. that means *Error: Invalid expression* is gone now..So hope you understand that your `xpath` expression was incorrect..:) Can you show me the output of `puts filelist` ? – Arup Rakshit Aug 20 '13 at 10:30
  • understand it is invalid for libxml (was OK for rexml and does seem to comply with standards). Code is now: filelist = repomd.find_first("//data[contains(@type,'filelists')]/location[@href]") puts "First: " + filelist Output is: /tmp/scripty.rb:7:in `+': can't convert nil into String (TypeError) from /tmp/scripty.rb:7 – MediumDaveR Aug 20 '13 at 10:35
  • @MediumDaveR why you are not trying [`nokogiri`](http://nokogiri.org/)? It is best.. – Arup Rakshit Aug 20 '13 at 10:36
  • that was going to be my next port of call, but convinced libxml should still work. Just updated the original post with another example script that fetches the file, perhaps you can work with that to figure out the problem? I certainly can't! – MediumDaveR Aug 20 '13 at 10:43
  • it's returning a LibXML::XML::Document – MediumDaveR Aug 20 '13 at 10:54
  • @MediumDaveR Ok.. Let's wait then for someone who using this library like you,if not I will give it a try when I will be at my home. How did you install this,I will apply the same in my laptop. Actually I have never used this one. But I can bet you..the `xpath` expression is ok..wrong is some where else.. – Arup Rakshit Aug 20 '13 at 11:06