21

I want Nokogiri to leave HTML entities untouched, but it seems to be converting the entities into the actual symbol. For example:

 Nokogiri::HTML.fragment('<p>&reg;</p>').to_s

results in: "<p>®</p>"

Nothing seems to return the original HTML back to me. The .inner_html, .text, .content methods all return '®' instead of '&reg;'

Is there a way for Nokogiri to leave these HTML entities untouched?

I've already searched stackoverflow and found similar questions, but nothing exactly like this one.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Richard
  • 1,146
  • 1
  • 13
  • 24
  • Have you seen [this question](http://stackoverflow.com/questions/2567029/how-to-make-nokogiri-transparently-return-un-encoded-html-entities-untouched)? – rdvdijk Oct 13 '11 at 15:01
  • 1
    That question only deals with leaving the UTF-8 as-is, not avoiding the decoding of entities. – tadman Oct 13 '11 at 15:08
  • rdvdijk - yes I've seen that question, but it isn't what I'm asking about. The author is getting the correct output in his first line of code, but I am not. – Richard Oct 13 '11 at 15:23
  • 1
    I'd vote to close this as a duplicate of http://stackoverflow.com/questions/4476047/how-to-make-nokogiri-not-to-convert-nbsp-to-space, except the accepted answer to that question is quite a bit of a hack instead of a clean "don't convert". – Phrogz Oct 13 '11 at 19:31
  • 2
    Using `to_html :encoding => 'US-ASCII'` instead of `to_s` outputs `

    ®

    `, which may be of use if your problem is trying to avoid encoding issues. It doesn't look like there's a way to make Nokogiri output named character entities from what I can tell though.
    – matt Oct 13 '11 at 19:52
  • Phrogz - cool, i guess i'm terrible at searching. it is a duplicate of [http://stackoverflow.com/questions/4476047/how-to-make-nokogiri-not-to-convert-nbsp-to-space] – Richard Oct 13 '11 at 19:58

1 Answers1

20

Not an ideal answer, but you can force it to generate entities (if not nice names) by setting the allowed encoding:

#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>&reg;</p>')
puts html.to_html                              #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' )       #=> <p>&#174;</p>

It would be nice if Nokogiri used 'nice' names of entities where defined, instead of always using the terse hexadecimal entity, but even that wouldn't be 'preserving' the original.

The root of the problem is that, in HTML, the following all describe the exact same content:

<p>®</p>
<p>&reg;</p>
<p>&#xAE;</p>  
<p>&#174;</p>

If you wanted the to_s representation of a text node to be actually &reg; then the markup describing that would really be: <p>&amp;reg;</p>.

If Nokogiri was to always return the same encoding per character as was used to enter the document it would need to store each character as a custom node recording the entity reference. There exists a class that might be used for this (Nokogiri::XML::EntityReference):

require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo&reg;</p>

However, I can't find a way to cause these to be created during parsing using Nokogiri v1.4.4 or v1.5.0. Specifically, the presence or absence of Nokogiri::XML::ParseOptions::NOENT during parsing does not appear to cause one to be created:

require 'nokogiri'
html = "<p>Foo&reg;</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
  Nokogiri::XML::ParseOptions::DEFAULT_HTML,
  Nokogiri::XML::ParseOptions::DEFAULT_XML,
  Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
  p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">
Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • 1
    The output from the `to_s` and `to_(x)html` methods also depends on your default encoding. If you've still got your test files around try adding `# encoding: UTF-8` to the top and rerunning them. I get `

    ®

    ` with `to_xhtml` by itself. The `to_(x)html` methods allow you to explicitly set the encoding you want, and it looks like Nokogiri is smart enough to escape any characters that can't be represented in the output encoding.
    – matt Oct 13 '11 at 20:43
  • 1
    The line `puts html.to_html( encoding:'US-ASCII' )`saved my day – lumpidu Feb 14 '12 at 09:30
  • FYI this doesn't work quite the same in JRuby, since it's using a library different than libxml. – Edward Anderson May 06 '13 at 21:56