5

I have an XML file that I need to parse. I have no control over the format of the file and cannot change it.

The file makes use of a prefix (call it a), but it doesn't define a namespace for that prefix anywhere. I can't seem to use xpath to query for nodes with the a namespace.

Here's the contents of the xml document

<?xml version="1.0" encoding="UTF-8"?>

<a:root>
  <a:thing>stuff0</a:thing>
  <a:thing>stuff1</a:thing>
  <a:thing>stuff2</a:thing>
  <a:thing>stuff3</a:thing>
  <a:thing>stuff4</a:thing>
  <a:thing>stuff5</a:thing>
  <a:thing>stuff6</a:thing>
  <a:thing>stuff7</a:thing>
  <a:thing>stuff8</a:thing>
  <a:thing>stuff9</a:thing>
</a:root>

I am using Nokogiri to query the document:

doc = Nokogiri::XML(open('text.xml'))
things = doc.xpath('//a:thing')

The fails giving the following error:

Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //a:thing

From my research, I found out that I could specify the namespace for the prefix in the xpath method:

things = doc.xpath('//a:thing', a: 'nobody knows')

This returns an empty array.

What would be the best way for me to get the nodes that I need?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Boris Bera
  • 868
  • 1
  • 6
  • 19

1 Answers1

5

The problem is that the namespace is not properly defined in the XML document. As a result, Nokogiri sees the node names as being "a:root" instead of "a" being a namespace and "root" being the node name:

xml = %Q{
    <?xml version="1.0" encoding="UTF-8"?>
    <a:root>
      <a:thing>stuff0</a:thing>
      <a:thing>stuff1</a:thing>
    </a:root>
}
doc = Nokogiri::XML(xml)
puts doc.at_xpath('*').node_name
#=> "a:root"
puts doc.at_xpath('*').namespace
#=> ""

Solution 1 - Specify node name with colon

One solution is to search for nodes with the name "a:thing". You cannot do //a:thing since the XPath will treat the "a" as a namespace. You can get around this by doing //*[name()="a:thing"]:

xml = %Q{
    <?xml version="1.0" encoding="UTF-8"?>
    <a:root>
      <a:thing>stuff0</a:thing>
      <a:thing>stuff1</a:thing>
    </a:root>
}
doc = Nokogiri::XML(xml)
things = doc.xpath('//*[name()="a:thing"]')
puts things
#=> <a:thing>stuff0</a:thing>
#=> <a:thing>stuff1</a:thing>

Solution 2 - Modify the XML document to define the namespace

An alternative solution is to modify the XML file that you get to properly define the namespace. The document will then behave with namespaces as expected:

xml = %Q{
    <?xml version="1.0" encoding="UTF-8"?>
    <a:root>
      <a:thing>stuff0</a:thing>
      <a:thing>stuff1</a:thing>
    </a:root>
}
xml.gsub!('<a:root>', '<a:root xmlns:a="foo">')
doc = Nokogiri::XML(xml)
things = doc.xpath('//a:thing')
puts things
#=> <a:thing>stuff0</a:thing>
#=> <a:thing>stuff1</a:thing>
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Justin Ko
  • 46,526
  • 5
  • 91
  • 101
  • That doesn't seem to work for me. I run `doc.xpath('//thing')` and I get an empty array back. Could it be a problem with the version? I'm running ruby 2.0.0p247 with nokogiri 1.6.0. – Boris Bera Nov 15 '13 at 16:49
  • 1
    Apparently things have changed between Nokogiri 1.5.8 and 1.6.0. The above script works in 1.5.8, but not in 1.6.0 (using Ruby 1.9.3). – Justin Ko Nov 15 '13 at 17:01
  • 1
    I updated the answer to have 2 solutions that work in Nokogiri 1.6.0. – Justin Ko Nov 15 '13 at 17:59
  • Solution 2 fits my needs the best. Is there a way to be able to modify the document through Nokogiri after it's loaded? – Boris Bera Nov 15 '13 at 18:20
  • Do you mean to add the namespace? I do not believe so. I sort of recall people mentioning that Nokogiri only registers the namespaces when the document is parsed. – Justin Ko Nov 15 '13 at 18:26
  • @JustinKo I do like your answer always, as I see you answers are always well arranged!! – Arup Rakshit Nov 15 '13 at 19:23