Is there a higher performance method for removing rare unwanted chars from a string?

Question

EDIT

Apologies if the original unedited question is misleading.

This question is not asking how to remove Invalid XML Chars from a string, answers to that question would be better directed here.

I'm not asking you to review my code.

What I'm looking for in answers is, a function with the signature

string <YourName>(string input, Func<char, bool> check);

that will have performance similar or better than RemoveCharsBufferCopyBlackList. Ideally this function would be more generic and if possible simpler to read, but these requirements are secondary.

I recently wrote a function to strip invalid XML chars from a string. In my application the strings can be modestly long and the invalid chars occur rarely. This excerise got me thinking. What ways can this be done in safe managed c# and, which would offer the best performance for my scenario.

Here is my test program, I've subtituted the "valid XML predicate" for one the omits the char 'X'.

class Program
{
    static void Main()
    {
        var attempts = new List<Func<string, Func<char, bool>, string>>
            {
                RemoveCharsLinqWhiteList,
                RemoveCharsFindAllWhiteList,
                RemoveCharsBufferCopyBlackList
            }

        const string GoodString = "1234567890abcdefgabcedefg";
        const string BadString = "1234567890abcdefgXabcedefg";
        const int Iterations = 100000;
        var timer = new StopWatch();

        var testSet = new List<string>(Iterations);
        for (var i = 0; i < Iterations; i++)
        {
            if (i % 1000 == 0)
            {
                testSet.Add(BadString);
            }
            else
            {
                testSet.Add(GoodString);
            }
        }

        foreach (var attempt in attempts)
        {
            //Check function works and JIT
            if (attempt.Invoke(BadString, IsNotUpperX) != GoodString)
            {
                throw new ApplicationException("Broken Function");       
            }

            if (attempt.Invoke(GoodString, IsNotUpperX) != GoodString)
            {
                throw new ApplicationException("Broken Function");       
            }

            timer.Reset();
            timer.Start();
            foreach (var t in testSet)
            {
                attempt.Invoke(t, IsNotUpperX);
            }

            timer.Stop();
            Console.WriteLine(
                "{0} iterations of function \"{1}\" performed in {2}ms",
                Iterations,
                attempt.Method,
                timer.ElapsedMilliseconds);
            Console.WriteLine();
        }

        Console.Readkey();
    }

    private static bool IsNotUpperX(char value)
    {
        return value != 'X';
    }

    private static string RemoveCharsLinqWhiteList(string input,
                                                      Func<char, bool> check);
    {
        return new string(input.Where(check).ToArray());
    }

    private static string RemoveCharsFindAllWhiteList(string input,
                                                      Func<char, bool> check);
    {
        return new string(Array.FindAll(input.ToCharArray(), check.Invoke));
    }

    private static string RemoveCharsBufferCopyBlackList(string input,
                                                      Func<char, bool> check);
    {
        char[] inputArray = null;
        char[] outputBuffer = null;

        var blackCount = 0;
        var lastb = -1;
        var whitePos = 0;

        for (var b = 0; b , input.Length; b++)
        {
            if (!check.invoke(input[b]))
            {
                var whites = b - lastb - 1;
                if (whites > 0)
                {
                    if (outputBuffer == null)
                    {
                        outputBuffer = new char[input.Length - blackCount];
                    }

                    if (inputArray == null)
                    {
                        inputArray = input.ToCharArray();
                    }

                    Buffer.BlockCopy(
                                      inputArray,
                                      (lastb + 1) * 2,
                                      outputBuffer,
                                      whitePos * 2,
                                      whites * 2);
                    whitePos += whites; 
                }

                lastb = b;
                blackCount++;
            }
        }

        if (blackCount == 0)
        {
            return input;
        }

        var remaining = inputArray.Length - 1 - lastb;
        if (remaining > 0)
        {
            Buffer.BlockCopy(
                              inputArray,
                              (lastb + 1) * 2,
                              outputBuffer,
                              whitePos * 2,
                              remaining * 2);

        }

        return new string(outputBuffer, 0, inputArray.Length - blackCount);
    }        
}

If you run the attempts you'll note that the performance improves as the functions get more specialised. Is there a faster and more generic way to perform this operation? Or if there is no generic option is there a way that is just faster?

Please note that I am not actually interested in removing 'X' and in practice the predicate is more complicated.

Not all XML chars can be escaped (reason unknown to me). Some are just invalid, escaped or not. — usr, Apr 20 '12 at 17:43
Since your question title is "how to make my function better", but not "how to correctly remove XML character"... Note that there are more "invalid" characters - surrogate pairs must be in correct order which is not covered by your algorithm (check out XML spec). Also make sure that everyone is aware of the fact that XML will not contain data that was send in... — Alexei Levenkov, Apr 20 '12 at 18:02
@AlexeiLevenkov, you make an interesting point but, as you note, I'm not really asking about Invalid XML chars in paticular. I think it would be useful if you directed your comment to this question http://stackoverflow.com/questions/20762/how-do-you-remove-invalid-hexadecimal-characters-from-an-xml-based-data-source-p/10244306#10244306 — Jodrell, Apr 23 '12 at 08:25

score 2 · Accepted Answer · answered Apr 20 '12 at 17:37

You certainly don't want to use LINQ to Objects aka enumerators to do this if you require high performance. Also, don't invoke a delegate per char. Delegate invocations are costly compared to the actual operation you are doing.

RemoveCharsBufferCopyBlackList looks good (except for the delegate call per character).

I recommend that you inline the contents of the delegate hard-coded. Play around with different ways to write the condition. You may get better performance by first checking the current char against a range of known good chars (e.g. 0x20-0xFF) and if it matches let it through. This test will pass almost always so you can save the expensive checks against individual characters which are invalid in XML.

Edit: I just remembered I solved this problem a while ago:

    static readonly string invalidXmlChars =
        Enumerable.Range(0, 0x20)
        .Where(i => !(i == '\u000A' || i == '\u000D' || i == '\u0009'))
        .Select(i => (char)i)
        .ConcatToString()
        + "\uFFFE\uFFFF";
    public static string RemoveInvalidXmlChars(string str)
    {
        return RemoveInvalidXmlChars(str, false);
    }
    internal static string RemoveInvalidXmlChars(string str, bool forceRemoveSurrogates)
    {
        if (str == null) throw new ArgumentNullException("str");
        if (!ContainsInvalidXmlChars(str, forceRemoveSurrogates))
            return str;

        str = str.RemoveCharset(invalidXmlChars);
        if (forceRemoveSurrogates)
        {
            for (int i = 0; i < str.Length; i++)
            {
                if (IsSurrogate(str[i]))
                {
                    str = str.Where(c => !IsSurrogate(c)).ConcatToString();
                    break;
                }
            }
        }

        return str;
    }
    static bool IsSurrogate(char c)
    {
        return c >= 0xD800 && c < 0xE000;
    }
    internal static bool ContainsInvalidXmlChars(string str)
    {
        return ContainsInvalidXmlChars(str, false);
    }
    public static bool ContainsInvalidXmlChars(string str, bool forceRemoveSurrogates)
    {
        if (str == null) throw new ArgumentNullException("str");
        for (int i = 0; i < str.Length; i++)
        {
            if (str[i] < 0x20 && !(str[i] == '\u000A' || str[i] == '\u000D' || str[i] == '\u0009'))
                return true;
            if (str[i] >= 0xD800)
            {
                if (forceRemoveSurrogates && str[i] < 0xE000)
                    return true;
                if ((str[i] == '\uFFFE' || str[i] == '\uFFFF'))
                    return true;
            }
        }
        return false;
    }

Notice, that RemoveInvalidXmlChars first invokes ContainsInvalidXmlChars to save the string allocation. Most strings do not contain invalid XML chars so we can be optimistic.

If you make `ContainsInvalidXmlChars` return `i` instead of `true` when it succeeds, you can then pass that value to `RemoveCharSet` to avoid iterating over the string twice. — Gabe, Apr 20 '12 at 17:59
I agree that delegates must carry a cost over inline code but, in my attempts I've used a delegate to highlight that I'm not interesting in the checking operation but rather the general iteration algorithm. Since your answer is specialised to the XML chars problem it is probably a better answer for this question http://stackoverflow.com/questions/20762/how-do-you-remove-invalid-hexadecimal-characters-from-an-xml-based-data-source-p/10244306#10244306. I've upvoted for the effort but you haven't provided an alternative attempt that you or I can test against my own, using the test code in the OP. — Jodrell, Apr 23 '12 at 08:34
Under the assumption that the delegate is opaque, your code is already very good. The only major improvement I can think of is to dynamically generate code at runtime using expression trees which is calling the method the delegate is pointing to directly. You can get the MethodInfo from the delegate as well as the object instance. This is kind of an insane effort but it *will* be faster. — usr, Apr 23 '12 at 10:05

Is there a higher performance method for removing rare unwanted chars from a string?

1 Answers1