0

I am using Nokogiri gem to parse HTML data.

$ gem list nokogiri

*** LOCAL GEMS ***

nokogiri (1.6.2.1)

Sample HTML is:

<html>
  <body>
    <xhtml:link>
      <div>
    Some content.
      </div>
    </xhtml:link>
  </body>
</html>

I am getting

>>  doc.xpath('/html/body/xhtml:link/div')
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: /html/body/xhtml:link/div
    from /var/lib/gems/1.9.1/gems/nokogiri-1.6.2.1/lib/nokogiri/xml/node.rb:159:in `evaluate'
    from /var/lib/gems/1.9.1/gems/nokogiri-1.6.2.1/lib/nokogiri/xml/node.rb:159:in `block in xpath'
    from /var/lib/gems/1.9.1/gems/nokogiri-1.6.2.1/lib/nokogiri/xml/node.rb:150:in `map'
    from /var/lib/gems/1.9.1/gems/nokogiri-1.6.2.1/lib/nokogiri/xml/node.rb:150:in `xpath'
    from (irb):95
    from /usr/bin/irb:12:in `<main>'

A full sample live HTML page can be found here

How can I avoid this error?

tuxdna
  • 8,257
  • 4
  • 43
  • 61
  • Not the actual problem, but it seems you have missed to close the `body` tag. – Tamer Shlash Jun 18 '14 at 10:14
  • 1
    can you not `doc.xpath('/html/body/link/div')` it? – Bala Jun 18 '14 at 11:13
  • I get the xpath by inspecting elements from Firebug. This works for other documents but whenever there is a colon ':' in an element tag, it gives the said error. – tuxdna Jun 18 '14 at 11:24
  • Are you parsing as HTML or XML? If you parse as HTML then Nokogiri strips of namespaces, so you can just use `link`. – matt Jun 18 '14 at 11:45

2 Answers2

2

You need to add the XML Namespace (xhtml in your example) to your root element so that Nokogiri recognizes it, unless you do that Nokogiri will ignore it and that error will appear.

You can do it this way:

<html xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <body>
        <xhtml:link>
            <div>Some content.</div>
        </xhtml:link>
    </body>
</html>

See also this and this answers.

UPDATE based on comment

I've reviewed Nokogiri docs and found two workarounds, one is to pass the namespaces:

doc.xpath('/html/body/xhtml:link/div', 'xhtml' => 'http://www.w3.org/1999/xhtml')

Another is to manually add that namespace to the root document:

doc.root.add_namespace 'xhtml', 'http://www.w3.org/1999/xhtml'
doc.xpath('/html/body/xhtml:link/div')

While both ways do silent the error, the query in both cases just returns an empty array for me, unlike what happens if the xmlns attribute was originally included in the document.

Community
  • 1
  • 1
Tamer Shlash
  • 9,314
  • 5
  • 44
  • 82
  • I cannot modify the existing HTML because it comes from an external source. So can I provide the namespaces to Nokigiri so that it can resolve without I modifying HTML content ? – tuxdna Jun 18 '14 at 11:26
0

You can ignore namespaces, if you are sure there are no unprefixed elements with the same name in the same context. Namespaces affect element and attribute names. If you select them using node(), or * you can test for the local-name() in a predicate without having to deal with namespaces.

In your example, you can select the xhtml:link element by selecting all elements in the context of body, and then restricting the result set to only those which have a local-name equal to link:

doc.xpath('/html/body/*[local-name()="link"]/div')

You might select unwanted HTML <link> elements if they occur in the body (they should never be there, but HTML parsers don't care if they are). But if they occur, they should be empty elements. There will never be one with a <div> inside, so you're safe.

helderdarocha
  • 23,209
  • 4
  • 50
  • 65
  • 1
    You can easily remove namespaces by using [`doc.remove_namespaces!`](http://nokogiri.org/Nokogiri/XML/Document.html#method-i-remove_namespaces-21). – Phrogz Jun 19 '14 at 10:14