0

I am reading some data from an XML webservice with Ruby, something like this:

<phrases>
  <phrase language="en_US">&iexcl;I&#39;m highly&nbsp;annoyed with character references!</phrase>
</phrases>

I'm parsing the XML and grabbing an array of phrases. As you can see, the phrase text contains some XML character entity references. I'd like to replace them with the actual character being referenced. This is simple enough with the numeric references, but nasty with the XML and HTML ones. I'd like to avoid having a big hash in my code that holds the character for each XML or HTML character reference, i.e. http://www.java2s.com/Code/Java/XML/Resolvesanentityreferenceorcharacterreferencetoitsvalue.htm

Surely there's a library for this out there, right?

Update

Yes, there is a library out there, and it's called HTMLEntities:

: jmglov@laurana; sudo gem install htmlentities
Successfully installed htmlentities-4.2.4
: jmglov@laurana;  irb
irb(main):001:0> require 'htmlentities'
=> []
irb(main):002:0> HTMLEntities.new.decode "&iexcl;I&#39;m highly&nbsp;annoyed with character references!"
=> "¡I'm highly annoyed with character references!"
Josh Glover
  • 25,142
  • 27
  • 92
  • 129
  • What about `>` and `<`? If you replace _all_ entities, you may break well-formed XML. – Matt Ball Mar 10 '11 at 16:02
  • Matt, the data I'm dealing with has already been XML parsed; I'm dealing with CDATA here, so I want all the entities resolved. I'll update the question to make this clear. – Josh Glover Mar 10 '11 at 16:05
  • So you have an XML document where the data itself contains entities? (i.e. you have an XML representation of the ASCII string ` `, and not an XML representation of a non-breaking space?). – Quentin Mar 10 '11 at 16:11
  • It should be pointed out that ` ` isn't a native XML entity. It appears in XHTML because the DTDs for XHTML define it. – Quentin Mar 10 '11 at 16:12
  • mu, thanks for the link to the [How to encode/decode html entities in ruby](http://stackoverflow.com/questions/1600526/how-to-encode-decode-html-entities-in-ruby) question. The solution therein gets me most of the way; I'll still need to deal with the annoying ¡ and   class of entities, but that should be doable. – Josh Glover Mar 11 '11 at 09:27
  • The result of my XML parsing (which I'm not personally doing--I'm working on a part of the system that is simply handed arrays of strings) is a string like "¡I'm highly annoyed with character references!". – Josh Glover Mar 11 '11 at 09:30

2 Answers2

2

REXML can do it, though it won't handle "&iexcl;" or "&nbsp;". The list of predefined XML entities (aside from Unicode numeric entities) is actually quite small. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

Given this input XML:

<phrases>
  <phrase language="en_US">&quot;I&#39;m highly annoyed with character references!&#x00a9;</phrase>
</phrases>

you can parse the XML and the embedded entities like this (for example):

require 'rexml/document'

doc = REXML::Document.new(File.open('/tmp/foo.xml').readlines.join(''))
phrase = REXML::XPath.first(doc, '//phrases/phrase')
text = phrase.first # Type is REXML::Text
puts(text.value)

Obviously, that example assumes that the XML is in file /tmp/foo.xml. You can just as easily pass a string of XML. On my Mac and Ubuntu systems, running it produces:

$ ruby /tmp/foo.rb
"I'm highly annoyed with character references!©
Brian Clapper
  • 25,705
  • 7
  • 65
  • 65
1

This isn't an attempt to provide a solution, it's to relate some of my own experiences dealing with XML from the wild. I was using Perl at first, then later using Ruby, and the experiences are something you can encounter easily if you grab enough XML or RDF/RSS/Atom feeds.

I've often seen XML CDATA contain HTML, both encoded and unencoded. The encoded HTML was probably the result of someone doing things the right way, via some API or library to generate XML. The unencoded HTML was probably someone using a script to wrap the HTML with tags, resulting in invalid XML, but I had to deal with it anyway.

I've also seen XML CDATA containing HTML that had been encoded multiple times, requiring me to unencode everything, even after the XML engine had done its thing. Sometimes during an intermediate pass I'd suddenly have non-UTF8 characters in the string along with encoded ones, as a result of someone appending comments or joining multiple HTML streams together that were from different character-sets. For whatever the reason, it was really ugly and caused XML parsing to break or emit a lot of warnings. I'd have to loop over the content, decoding and checking to see if the previous pass was the same as the current decoding pass, and bailing if nothing had changed. There was no guarantee I'd have a string in a valid character-set at the time though, so I'd have to tell iconv to convert it to UTF8 and throw away characters that wouldn't convert cleanly.

Nokogiri can decode the content of a node various ways, by creative use of the to_xml and to_html methods. You can also look at the HTMLEntities gem, Loofah, and others to go after the CDATA contents. Loofah is nice because it's designed to whitelist/blacklist tags you might encounter.

The XML spec is supposed to protect us from such shenanigans, but, as one of my co-workers used to tell me, "We can make it fool-proof, but not damn-fool-proof". People are SO inventive and the specs mean nothing to someone who didn't bother to read them or doesn't care.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thanks so much for the pointer to [HTMLEntities](http://htmlentities.rubyforge.org/)! I submitted an edit to your answer to demonstrate how I solved my problem, so it should show up when it is reviewed. – Josh Glover Mar 11 '11 at 09:44
  • @Josh Glover, You should not edit my answer. You should edit your original question to show how you used the answer. – the Tin Man Mar 11 '11 at 14:53