4

I'm developing a class for a content management system. The input content is supplied in XHTML format. And it can contain valid escaped characters like £ See the example below.

<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
  <head xmlns="">
    <meta name="Attr_DocumentTitle" content="Hello World Books" />
   </head>
  <body>

 <div>British Pound   &#163;</div>

 <div>Registered sign &#174;</div>

 <div>Copyright sign &#169; </div>

  </body>
</html>

My objective is to write a method that loads this to an XML .Net object do some processing and save to database. I want to maintain the escaped characters as they are. And here is my method:

public static XmlDocument LoadXmlFromString(string xhtmlContent)
{
    byte[] xhtmlByte = Encoding.ASCII.GetBytes(xhtmlContent);
    MemoryStream mStream = new MemoryStream(xhtmlByte);
    XmlReaderSettings settings = new XmlReaderSettings();
    //Upon loading XML, prevent DTD download, which would be blocked by our 
    //firewall and generate "503 Server Unavailable" error.
    settings.XmlResolver = null;
    settings.ProhibitDtd = false;
    XmlReader reader = XmlReader.Create(mStream, settings);
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.LoadXml(xhtmlContent);
    return xmlDoc; //Value of xmlDoc.InnerXml contains £ ® © in place 
                    // of &#163; &#174; and &#169;
}

This method however converts the escaped characters to their character equivalents. How can I avoid this and keep the escaped characters.

CleanCoder
  • 807
  • 1
  • 9
  • 19
  • 1
    Why do you want to do that? Do you want XML or text? – SLaks Dec 20 '10 at 18:40
  • To render it in a browser. With special characters it throws error because it will not be a valid xml. – CleanCoder Dec 20 '10 at 18:57
  • So I want xml and I want the value of xmlDoc.InnerXml to have escaped characters. I don't understand why it replaces the escaped characters upon loading, it makes the xml invalid. – CleanCoder Dec 20 '10 at 19:04
  • 3
    Possible duplicate of [.NET XmlDocument LoadXML and Entities](http://stackoverflow.com/questions/152900/net-xmldocument-loadxml-and-entities). Short answer: that's by design and shouldn't bother you. What really matters is the way you output the markup to the browser. – Frédéric Hamidi Dec 20 '10 at 19:21
  • 4
    @ReggaeMan — if it is throwing errors, then you have a character encoding problem. Deal with that problem instead of trying to work around it. – Quentin Dec 20 '10 at 19:24
  • I believe its critical to sanitize both the data going and also going out. See here: http://markupsanitizer.codeplex.com/ for some more ideas... – Jeremy Thompson Dec 21 '10 at 00:04
  • 2
    ReggeaMan, XML and XHTML build on and support Unicode so there is no need to escape non-ASCII characters in XHTML to have well-formed XHTML and to have it render properly in browsers. Thus I think if you have problems with getting characters like `£` properly rendered then that is simply a problem of letting the browser know the encoding of the document you send so make sure you set the charset parameter of the Content-Type HTTP header. – Martin Honnen Dec 21 '10 at 12:26
  • @Frédéric Hamidi Thanks for the link it helped a lot. And I stopped trying to retain the escaped characters. @Martin Honnen I tried sending files with special characters like £ and it works if I save the file as UTF-8 encoded using Notepad++. Thank you. – CleanCoder Dec 21 '10 at 20:34

1 Answers1

3

Check this: why does xmltextreader convert html encoded utf8 characters to utf8 string automatically

Community
  • 1
  • 1
Maxim Gueivandov
  • 2,370
  • 1
  • 20
  • 33