75

Is there any easy/general way to clean an XML based data source prior to using it in an XmlReader so that I can gracefully consume XML data that is non-conformant to the hexadecimal character restrictions placed on XML?

Note:

  • The solution needs to handle XML data sources that use character encodings other than UTF-8, e.g. by specifying the character encoding at the XML document declaration. Not mangling the character encoding of the source while stripping invalid hexadecimal characters has been a major sticking point.
  • The removal of invalid hexadecimal characters should only remove hexadecimal encoded values, as you can often find href values in data that happens to contains a string that would be a string match for a hexadecimal character.

Background:

I need to consume an XML-based data source that conforms to a specific format (think Atom or RSS feeds), but want to be able to consume data sources that have been published which contain invalid hexadecimal characters per the XML specification.

In .NET if you have a Stream that represents the XML data source, and then attempt to parse it using an XmlReader and/or XPathDocument, an exception is raised due to the inclusion of invalid hexadecimal characters in the XML data. My current attempt to resolve this issue is to parse the Stream as a string and use a regular expression to remove and/or replace the invalid hexadecimal characters, but I am looking for a more performant solution.

Grhm
  • 6,726
  • 4
  • 40
  • 64
Oppositional
  • 11,141
  • 6
  • 50
  • 63

14 Answers14

78

It may not be perfect (emphasis added since people missing this disclaimer), but what I've done in that case is below. You can adjust to use with a stream.

/// <summary>
/// Removes control characters and other non-UTF-8 characters
/// </summary>
/// <param name="inString">The string to process</param>
/// <returns>A string with no control characters or entities above 0x00FD</returns>
public static string RemoveTroublesomeCharacters(string inString)
{
    if (inString == null) return null;

    StringBuilder newString = new StringBuilder();
    char ch;

    for (int i = 0; i < inString.Length; i++)
    {

        ch = inString[i];
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        //if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
        //if using .NET version prior to 4, use above logic
        if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
        {
            newString.Append(ch);
        }
    }
    return newString.ToString();

}
Eugene Katz
  • 5,208
  • 7
  • 40
  • 49
  • It didn't catch 0x3c , one of my application thinks that its hex , I am using ibex – Thunder Sep 21 '10 at 06:22
  • 1
    try dnewcome's solution below. – Eugene Katz Sep 21 '10 at 17:07
  • 2
    -1 this answer is misleading because it removes characters that are valid in XML, that are not control characters, and that are valid UTF-8. – Daniel Cassidy Sep 02 '11 at 15:43
  • @DanielCassidy You've posted about how these are all wrong but I don't see a post by you with the correct solution. Would you mind posting a code sample showing how we should correctly handle removing control characters? – Ryan Rinaldi Oct 21 '11 at 16:33
  • @RyanRinaldi I have a better answer partially written, I will try to finish it off and post it this weekend. In the meantime, dnewcome’s answer is the best, providing that you’re willing to accept that it doesn’t work properly if your XML contains characters from outside the Basic Multilingual Plane. Since the Basic Multilingual Plane includes every character required for every modern natural language, that’s probably not a problem for you, although there are real-world use cases that require characters outside that range (some mathematics, for instance). – Daniel Cassidy Oct 21 '11 at 16:55
  • 2
    If you want to update the answer with a better range of filters, feel free to do so. As my answer states, it may not be perfect, but it served my needs. – Eugene Katz Oct 21 '11 at 18:52
  • 3
    I used XmlConvert.IsXmlChar(ch) for my filter. – Brad J Feb 25 '15 at 16:27
  • 1
    @BradJ, very good point. The method seems to have been added in .NET 4, so switched code to just use that in the example. Thanks! – Eugene Katz Feb 25 '15 at 20:21
60

I like Eugene's whitelist concept. I needed to do a similar thing as the original poster, but I needed to support all Unicode characters, not just up to 0x00FD. The XML spec is:

Char = #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

In .NET, the internal representation of Unicode characters is only 16 bits, so we can't `allow' 0x10000-0x10FFFF explicitly. The XML spec explicitly disallows the surrogate code points starting at 0xD800 from appearing. However it is possible that if we allowed these surrogate code points in our whitelist, utf-8 encoding our string might produce valid XML in the end as long as proper utf-8 encoding was produced from the surrogate pairs of utf-16 characters in the .NET string. I haven't explored this though, so I went with the safer bet and didn't allow the surrogates in my whitelist.

The comments in Eugene's solution are misleading though, the problem is that the characters we are excluding are not valid in XML ... they are perfectly valid Unicode code points. We are not removing `non-utf-8 characters'. We are removing utf-8 characters that may not appear in well-formed XML documents.

public static string XmlCharacterWhitelist( string in_string ) {
    if( in_string == null ) return null;

    StringBuilder sbOutput = new StringBuilder();
    char ch;

    for( int i = 0; i < in_string.Length; i++ ) {
        ch = in_string[i];
        if( ( ch >= 0x0020 && ch <= 0xD7FF ) || 
            ( ch >= 0xE000 && ch <= 0xFFFD ) ||
            ch == 0x0009 ||
            ch == 0x000A || 
            ch == 0x000D ) {
            sbOutput.Append( ch );
        }
    }
    return sbOutput.ToString();
}
dnewcome
  • 2,045
  • 17
  • 15
  • it will append **&** and this causes `doc = XDocument.Load(@strXMLPath);` to give exception – CODError Feb 18 '14 at 09:07
  • 1
    hello, do you think XmlConvert.IsXmlChar() would be more accurate? Eugene's answer changed since your last comment. thanks – DaFi4 Apr 07 '17 at 09:53
32

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

void Main() {
    string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    content = RemoveInvalidXmlChars(content);
    Console.WriteLine(IsValidXmlString(content)); // True
}

static string RemoveInvalidXmlChars(string text) {
    char[] validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
    return new string(validXmlChars);
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}
Igor Kustov
  • 3,228
  • 2
  • 34
  • 31
13

DRY implementation of this answer's solution (using a different constructor - feel free to use the one you need in your application):

public class InvalidXmlCharacterReplacingStreamReader : StreamReader
{
    private readonly char _replacementCharacter;

    public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter) : base(fileName)
    {
        this._replacementCharacter = replacementCharacter;
    }

    public override int Peek()
    {
        int ch = base.Peek();
        if (ch != -1 && IsInvalidChar(ch))
        {
            return this._replacementCharacter;
        }
        return ch;
    }

    public override int Read()
    {
        int ch = base.Read();
        if (ch != -1 && IsInvalidChar(ch))
        {
            return this._replacementCharacter;
        }
        return ch;
    }

    public override int Read(char[] buffer, int index, int count)
    {
        int readCount = base.Read(buffer, index, count);
        for (int i = index; i < readCount + index; i++)
        {
            char ch = buffer[i];
            if (IsInvalidChar(ch))
            {
                buffer[i] = this._replacementCharacter;
            }
        }
        return readCount;
    }

    private static bool IsInvalidChar(int ch)
    {
        return (ch < 0x0020 || ch > 0xD7FF) &&
               (ch < 0xE000 || ch > 0xFFFD) &&
                ch != 0x0009 &&
                ch != 0x000A &&
                ch != 0x000D;
    }
}
Community
  • 1
  • 1
Victor Zakharov
  • 25,801
  • 18
  • 85
  • 151
  • maybe its better to use XmlConvert.IsXmlChar() over the ch range checks? what do you think? – DaFi4 Apr 07 '17 at 09:51
  • @montewhizdoh: IsXmlChar is new in .NET 4. If that's available to you, feel free to use. This solution is .NET 2.0+. – Victor Zakharov Apr 07 '17 at 11:29
  • 1
    The same approach I have implemented for myself, but by I inherited from Stream which was not such a good idea because Stream.Read() operated with the array of bytes, not chars and it wasn't as elegant to check the characters. Your solution by inheriting from StreamReader is better, thank you! – Mar Sep 07 '17 at 06:36
  • 1
    +1 Because this allows reading REALLY big XML files (successfully tested with 100MB files). Solutions that loaded everything into a String before filtering out the bad characters failed with OutOfMemory exceptions. – Brad Oestreicher Nov 02 '17 at 21:11
9

Modernising dnewcombe's answer, you could take a slightly simpler approach

public static string RemoveInvalidXmlChars(string input)
{
    var isValid = new Predicate<char>(value =>
        (value >= 0x0020 && value <= 0xD7FF) ||
        (value >= 0xE000 && value <= 0xFFFD) ||
        value == 0x0009 ||
        value == 0x000A ||
        value == 0x000D);

    return new string(Array.FindAll(input.ToCharArray(), isValid));
}

or, with Linq

public static string RemoveInvalidXmlChars(string input)
{
    return new string(input.Where(value =>
        (value >= 0x0020 && value <= 0xD7FF) ||
        (value >= 0xE000 && value <= 0xFFFD) ||
        value == 0x0009 ||
        value == 0x000A ||
        value == 0x000D).ToArray());
}

I'd be interested to know how the performance of these methods compares and how they all compare to a black list approach using Buffer.BlockCopy.

Community
  • 1
  • 1
Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • I have had an issue with the Linq method throwing System.OutOfMemoryException when the XML string on larger XML files. – Brad J Feb 25 '15 at 16:11
  • @BradJ presumably, the string passed in is very long in those cases? – Jodrell Feb 25 '15 at 16:14
  • @BradJ ultimately, some sort of stream transform would be better, you could pass that directly to `XmlReader.Create` instead of loading the whole file into a string in memory. – Jodrell Feb 25 '15 at 16:41
  • 2
    just did a speed test compared to dnewcombe's answer and both of your solutions are about 3-4 times faster with the Linq version being only slightly slower than your non linq version. I was not expecting that sort of difference. used long strings and 100k iterations with stopwatch to work out timings. – Seer May 14 '15 at 03:49
  • @Seer I'm using ~60k length character streams and this solution works out to be a bit slower than the StringBuilder method, not sure what I've done differently. – adotout Jul 08 '16 at 15:36
5

Here is dnewcome's answer in a custom StreamReader. It simply wraps a real stream reader and replaces the characters as they are read.

I only implemented a few methods to save myself time. I used this in conjunction with XDocument.Load and a file stream and only the Read(char[] buffer, int index, int count) method was called, so it worked like this. You may need to implement additional methods to get this to work for your application. I used this approach because it seems more efficient than the other answers. I also only implemented one of the constructors, you could obviously implement any of the StreamReader constructors that you need, since it is just a pass through.

I chose to replace the characters rather than removing them because it greatly simplifies the solution. In this way the length of the text stays the same, so there is no need to keep track of a separate index.

public class InvalidXmlCharacterReplacingStreamReader : TextReader
{
    private StreamReader implementingStreamReader;
    private char replacementCharacter;

    public InvalidXmlCharacterReplacingStreamReader(Stream stream, char replacementCharacter)
    {
        implementingStreamReader = new StreamReader(stream);
        this.replacementCharacter = replacementCharacter;
    }

    public override void Close()
    {
        implementingStreamReader.Close();
    }

    public override ObjRef CreateObjRef(Type requestedType)
    {
        return implementingStreamReader.CreateObjRef(requestedType);
    }

    public void Dispose()
    {
        implementingStreamReader.Dispose();
    }

    public override bool Equals(object obj)
    {
        return implementingStreamReader.Equals(obj);
    }

    public override int GetHashCode()
    {
        return implementingStreamReader.GetHashCode();
    }

    public override object InitializeLifetimeService()
    {
        return implementingStreamReader.InitializeLifetimeService();
    }

    public override int Peek()
    {
        int ch = implementingStreamReader.Peek();
        if (ch != -1)
        {
            if (
                (ch < 0x0020 || ch > 0xD7FF) &&
                (ch < 0xE000 || ch > 0xFFFD) &&
                ch != 0x0009 &&
                ch != 0x000A &&
                ch != 0x000D
                )
            {
                return replacementCharacter;
            }
        }
        return ch;
    }

    public override int Read()
    {
        int ch = implementingStreamReader.Read();
        if (ch != -1)
        {
            if (
                (ch < 0x0020 || ch > 0xD7FF) &&
                (ch < 0xE000 || ch > 0xFFFD) &&
                ch != 0x0009 &&
                ch != 0x000A &&
                ch != 0x000D
                )
            {
                return replacementCharacter;
            }
        }
        return ch;
    }

    public override int Read(char[] buffer, int index, int count)
    {
        int readCount = implementingStreamReader.Read(buffer, index, count);
        for (int i = index; i < readCount+index; i++)
        {
            char ch = buffer[i];
            if (
                (ch < 0x0020 || ch > 0xD7FF) &&
                (ch < 0xE000 || ch > 0xFFFD) &&
                ch != 0x0009 &&
                ch != 0x000A &&
                ch != 0x000D
                )
            {
                buffer[i] = replacementCharacter;
            }
        }
        return readCount;
    }

    public override Task<int> ReadAsync(char[] buffer, int index, int count)
    {
        throw new NotImplementedException();
    }

    public override int ReadBlock(char[] buffer, int index, int count)
    {
        throw new NotImplementedException();
    }

    public override Task<int> ReadBlockAsync(char[] buffer, int index, int count)
    {
        throw new NotImplementedException();
    }

    public override string ReadLine()
    {
        throw new NotImplementedException();
    }

    public override Task<string> ReadLineAsync()
    {
        throw new NotImplementedException();
    }

    public override string ReadToEnd()
    {
        throw new NotImplementedException();
    }

    public override Task<string> ReadToEndAsync()
    {
        throw new NotImplementedException();
    }

    public override string ToString()
    {
        return implementingStreamReader.ToString();
    }
}
Community
  • 1
  • 1
Ryan Adams
  • 51
  • 1
  • 2
4

Regex based approach

public static string StripInvalidXmlCharacters(string str)
{
    var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
    return invalidXmlCharactersRegex.Replace(str, "");

}

See my blogpost for more details

mnaoumov
  • 2,146
  • 2
  • 22
  • 31
3

I created a slightly updated version of @Neolisk's answer, which supports the *Async functions and uses the .Net 4.0 XmlConvert.IsXmlChar function.

public class InvalidXmlCharacterReplacingStreamReader : StreamReader
{
    private readonly char _replacementCharacter;

    public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter) : base(fileName)
    {
        _replacementCharacter = replacementCharacter;
    }

    public InvalidXmlCharacterReplacingStreamReader(Stream stream, char replacementCharacter) : base(stream)
    {
        _replacementCharacter = replacementCharacter;
    }

    public override int Peek()
    {
        var ch = base.Peek();
        if (ch != -1 && IsInvalidChar(ch))
        {
            return _replacementCharacter;
        }
        return ch;
    }

    public override int Read()
    {
        var ch = base.Read();
        if (ch != -1 && IsInvalidChar(ch))
        {
            return _replacementCharacter;
        }
        return ch;
    }

    public override int Read(char[] buffer, int index, int count)
    {
        var readCount = base.Read(buffer, index, count);
        ReplaceInBuffer(buffer, index, readCount);
        return readCount;
    }

    public override async Task<int> ReadAsync(char[] buffer, int index, int count)
    {
        var readCount = await base.ReadAsync(buffer, index, count).ConfigureAwait(false);
        ReplaceInBuffer(buffer, index, readCount);
        return readCount;
    }

    private void ReplaceInBuffer(char[] buffer, int index, int readCount)
    {
        for (var i = index; i < readCount + index; i++)
        {
            var ch = buffer[i];
            if (IsInvalidChar(ch))
            {
                buffer[i] = _replacementCharacter;
            }
        }
    }

    private static bool IsInvalidChar(int ch)
    {
        return IsInvalidChar((char)ch);
    }

    private static bool IsInvalidChar(char ch)
    {
        return !XmlConvert.IsXmlChar(ch);
    }
}
Georg Jung
  • 949
  • 10
  • 27
2

The above solutions seem to be for removing invalid characters prior to converting to XML.

Use this code to remove invalid XML characters from an XML string. eg. &x1A;

    public static string CleanInvalidXmlChars( string Xml, string XMLVersion )
    {
        string pattern = String.Empty;
        switch( XMLVersion )
        {
            case "1.0":
                pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F]);";
                break;
            case "1.1":
                pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF]);";
                break;
            default:
                throw new Exception( "Error: Invalid XML Version!" );
        }

        Regex regex = new Regex( pattern, RegexOptions.IgnoreCase );
        if( regex.IsMatch( Xml ) )
            Xml = regex.Replace( Xml, String.Empty );
        return Xml;
    }

http://balajiramesh.wordpress.com/2008/05/30/strip-illegal-xml-characters-based-on-w3c-standard/

Nathan G
  • 63
  • 2
  • 5
  • 1
    -1 This answer does not address the question asked, and in any case is wrong and misleading because it only removes invalid XML Character Entity References, but not invalid XML characters. – Daniel Cassidy Sep 02 '11 at 15:53
1

Modified answer or original answer by Neolisk above.
Changes: of \0 character is passed, removal is done, rather than a replacement. also, made use of XmlConvert.IsXmlChar(char) method

    /// <summary>
    /// Replaces invalid Xml characters from input file, NOTE: if replacement character is \0, then invalid Xml character is removed, instead of 1-for-1 replacement
    /// </summary>
    public class InvalidXmlCharacterReplacingStreamReader : StreamReader
    {
        private readonly char _replacementCharacter;

        public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter)
            : base(fileName)
        {
            _replacementCharacter = replacementCharacter;
        }

        public override int Peek()
        {
            int ch = base.Peek();
            if (ch != -1 && IsInvalidChar(ch))
            {
                if ('\0' == _replacementCharacter)
                    return Peek(); // peek at the next one

                return _replacementCharacter;
            }
            return ch;
        }

        public override int Read()
        {
            int ch = base.Read();
            if (ch != -1 && IsInvalidChar(ch))
            {
                if ('\0' == _replacementCharacter)
                    return Read(); // read next one

                return _replacementCharacter;
            }
            return ch;
        }

        public override int Read(char[] buffer, int index, int count)
        {
            int readCount= 0, ch;

            for (int i = 0; i < count && (ch = Read()) != -1; i++)
            {
                readCount++;
                buffer[index + i] = (char)ch;
            }

            return readCount;
        }


        private static bool IsInvalidChar(int ch)
        {
            return !XmlConvert.IsXmlChar((char)ch);
        }
    }
BogdanRB
  • 158
  • 1
  • 5
0

Use this function to remove invalid xml characters.

public static string CleanInvalidXmlChars(string text)   
{   
       string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";   
       return Regex.Replace(text, re, "");   
} 
Munavvar
  • 802
  • 1
  • 11
  • 33
-1
private static String removeNonUtf8CompliantCharacters( final String inString ) {
    if (null == inString ) return null;
    byte[] byteArr = inString.getBytes();
    for ( int i=0; i < byteArr.length; i++ ) {
        byte ch= byteArr[i]; 
        // remove any characters outside the valid UTF-8 range as well as all control characters
        // except tabs and new lines
        if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
            byteArr[i]=' ';
        }
    }
    return new String( byteArr );
}
savio
  • 1
-1

You can pass non-UTF characters with the following:

string sFinalString  = "";
string hex = "";
foreach (char ch in UTFCHAR)
{
    int tmp = ch;
   if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
    {
    sFinalString  += ch;
    }
    else
    {  
      sFinalString  += "&#" + tmp+";";
    }
}
Murari Kumar
  • 359
  • 2
  • 9
  • 1
    -1 This answer is wrong because it generates invalid XML character entity references (for example `` is not a valid XML character entity reference). Also it is misleading because it removes characters that are valid in both Unicode and XML. – Daniel Cassidy Sep 02 '11 at 15:50
  • ya thats true but above solution is for if you want to pass invalid xml in xml file, than it will work or you cant pass invalid xml character in xml document – Murari Kumar Jan 06 '12 at 13:29
  • You can’t pass invalid XML characters in an XML document no matter what you do. For instance, the character `U+0001 START OF HEADING` is not allowed in a well-formed XML document, and even if you try to escape it as ``, that’s still not allowed in a well-formed XML document. – Daniel Cassidy Jan 10 '12 at 14:38
-5

Try this for PHP!

$goodUTF8 = iconv("utf-8", "utf-8//IGNORE", $badUTF8);
Jason Plank
  • 2,336
  • 5
  • 31
  • 40