4

In my C# app, XML data may contain arbitrary element text that's already been pre-processed, so that (among other things) illegal characters have been converted to their escaped (xml character entity encoded) form.

Example: <myElement>this & that</myElement> has been converted to <myElement>this &amp; that</myElement>.

The problem is that when I use XmlTextWriter to save the file, the '&' is getting re-escaped into <myElement>this &amp;amp; that</myElement>. I don't want that extra &amp; in the string.

Another example: <myElement>• bullet</myElement>, my processing changes it to <myElement>&#8226; bullet</myElement> which gets saved to <myElement>&amp;#8226; bullet</myElement>. All I want output to the file is the <myElement>&#8226; bullet</myElement> form.

I've tried various options on the various XmlWriters, etc but can't seem to get the raw strings to get output correctly. And why can't the XML parser recognize & not rewrite already a valid escapes?

update: afer more debugging, I found that element text strings (actually all strings including element tags, names, attributes, etc. ) get encoded whenever they get copied into the .net xml object data (CDATA being an exception) by an internal class called XmlCharType under System.Xml. So the problem has nothing to do with the XmlWriters. It looks like the best way to solve the problem is to un-escape the data when it's output, either by using something like:

string output = System.Net.WebUtility.HtmlDecode(xmlDoc.OuterXml);

Which will probably evolve into a custom XmlWriter in order to preserve formatting, etc.

Thanks all for the helpful suggestions.

Dave G
  • 115
  • 1
  • 11
  • 3
    Can you post a snippet of how you're using `XmlTextWriter`? If your C# has already created an XML string, why are you using `XmlTextWriter`? – Jacob Feb 14 '12 at 23:44
  • The code involved is actually kind of extensive. I'm using XmlTextWriter to just serialize the XML to a file. I need-be, I can create a sample app that reproduces the behavior, but the issue has to be well-known. Apologies in advance if this has been answered, but I can't seem to find anything relevant other than to drop down to WriteRaw which seems like a hack. – Dave G Feb 14 '12 at 23:53
  • 1
    Just a thought, but shouldn't a `CDATA` block allow this? – M.Babcock Feb 15 '12 at 00:10

2 Answers2

3

Ok, here's the solution I came up with:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Runtime.Versioning;
using System.Text;

namespace YourName {

    // Represents a writer that makes it possible to pre-process 
    // XML character entity escapes without them being rewritten.
    class XmlRawTextWriter : System.Xml.XmlTextWriter {
        public XmlRawTextWriter(Stream w, Encoding encoding)
            : base(w, encoding) {
        }

        public XmlRawTextWriter(String filename, Encoding encoding)
            : base(filename, encoding) {
        }

        public override void WriteString(string text) {
            base.WriteRaw(text);
        }
    }
}

then using that as you would XmlTextWriter:

        XmlRawTextWriter rawWriter = new XmlRawTextWriter(thisFilespec, Encoding.UTF8);
        rawWriter.Formatting = Formatting.Indented;
        rawWriter.Indentation = 1;
        rawWriter.IndentChar = '\t';
        xmlDoc.Save(rawWriter);

This works without having to un-encode or hack around the encoding functionality.

Dave G
  • 115
  • 1
  • 11
1

calling xmlwriter.writeraw instead. But it is not smart enough to check the characters are valid or not. So you have to check by yourself otherwise an invalid xml will be generated.

findcaiyzh
  • 647
  • 3
  • 7
  • Yeah, that's a thought - but while testing your suggestion I realized the element text is actually encoded in the XML tree. I had thought it was getting encoded on output but it's apparently on input or access. – Dave G Feb 15 '12 at 00:50