Using scala.xml.parsing.XhtmlParser
I can parse an XHTML document without either losing or having to resolve the entity references against the DTD. However, XhtmlParser
appears to do this by internally resolving the entities, such that, for instance —
becomes a literal —
, “
becomes a literal “
, and so on.
This is clearly the right thing to do if you want to extract Unicode text from an XHTML document. However, once I've imported the XHTML and munged it in various ways, I need to output it again, and I don't trust the downstream system to handle encodings correctly. I'd like to output my results in an ASCII-safe manner, thus turning the —
s back into —
es and so on.
I've tried using scala.xml.Xhtml.toXhtml()
on my Elem
objects, but it just produces (sensibly enough) a Unicode String
, with the only things encoded being &
, <
and >
as required by XML.
I suppose I could take scala.xml.parsing.XhtmlEntities.entList
, go through my output string character by character, and make the substitution myself, this seems like a chore. (Plus I wouldn't be able to use the raw list, as I'd have to skip the legit <
s, >
s, and &
s in the XML output.)
Is there anything in the Scala XML libraries that will do this for me, or is the manual scan/replace my best option?