Not an ideal answer, but you can force it to generate entities (if not nice names) by setting the allowed encoding:
#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>®</p>')
puts html.to_html #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' ) #=> <p>®</p>
It would be nice if Nokogiri used 'nice' names of entities where defined, instead of always using the terse hexadecimal entity, but even that wouldn't be 'preserving' the original.
The root of the problem is that, in HTML, the following all describe the exact same content:
<p>®</p>
<p>®</p>
<p>®</p>
<p>®</p>
If you wanted the to_s
representation of a text node to be actually ®
then the markup describing that would really be: <p>&reg;</p>
.
If Nokogiri was to always return the same encoding per character as was used to enter the document it would need to store each character as a custom node recording the entity reference. There exists a class that might be used for this (Nokogiri::XML::EntityReference
):
require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo®</p>
However, I can't find a way to cause these to be created during parsing using Nokogiri v1.4.4 or v1.5.0. Specifically, the presence or absence of Nokogiri::XML::ParseOptions::NOENT
during parsing does not appear to cause one to be created:
require 'nokogiri'
html = "<p>Foo®</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
Nokogiri::XML::ParseOptions::DEFAULT_HTML,
Nokogiri::XML::ParseOptions::DEFAULT_XML,
Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">
®
`, which may be of use if your problem is trying to avoid encoding issues. It doesn't look like there's a way to make Nokogiri output named character entities from what I can tell though. – matt Oct 13 '11 at 19:52