EDIT
Apologies if the original unedited question is misleading.
This question is not asking how to remove Invalid XML Chars from a string
, answers to that question would be better directed here.
I'm not asking you to review my code.
What I'm looking for in answers is, a function with the signature
string <YourName>(string input, Func<char, bool> check);
that will have performance similar or better than RemoveCharsBufferCopyBlackList
. Ideally this function would be more generic and if possible simpler to read, but these requirements are secondary.
I recently wrote a function to strip invalid XML chars from a string. In my application the strings can be modestly long and the invalid chars occur rarely. This excerise got me thinking. What ways can this be done in safe managed c# and, which would offer the best performance for my scenario.
Here is my test program, I've subtituted the "valid XML predicate" for one the omits the char 'X'
.
class Program
{
static void Main()
{
var attempts = new List<Func<string, Func<char, bool>, string>>
{
RemoveCharsLinqWhiteList,
RemoveCharsFindAllWhiteList,
RemoveCharsBufferCopyBlackList
}
const string GoodString = "1234567890abcdefgabcedefg";
const string BadString = "1234567890abcdefgXabcedefg";
const int Iterations = 100000;
var timer = new StopWatch();
var testSet = new List<string>(Iterations);
for (var i = 0; i < Iterations; i++)
{
if (i % 1000 == 0)
{
testSet.Add(BadString);
}
else
{
testSet.Add(GoodString);
}
}
foreach (var attempt in attempts)
{
//Check function works and JIT
if (attempt.Invoke(BadString, IsNotUpperX) != GoodString)
{
throw new ApplicationException("Broken Function");
}
if (attempt.Invoke(GoodString, IsNotUpperX) != GoodString)
{
throw new ApplicationException("Broken Function");
}
timer.Reset();
timer.Start();
foreach (var t in testSet)
{
attempt.Invoke(t, IsNotUpperX);
}
timer.Stop();
Console.WriteLine(
"{0} iterations of function \"{1}\" performed in {2}ms",
Iterations,
attempt.Method,
timer.ElapsedMilliseconds);
Console.WriteLine();
}
Console.Readkey();
}
private static bool IsNotUpperX(char value)
{
return value != 'X';
}
private static string RemoveCharsLinqWhiteList(string input,
Func<char, bool> check);
{
return new string(input.Where(check).ToArray());
}
private static string RemoveCharsFindAllWhiteList(string input,
Func<char, bool> check);
{
return new string(Array.FindAll(input.ToCharArray(), check.Invoke));
}
private static string RemoveCharsBufferCopyBlackList(string input,
Func<char, bool> check);
{
char[] inputArray = null;
char[] outputBuffer = null;
var blackCount = 0;
var lastb = -1;
var whitePos = 0;
for (var b = 0; b , input.Length; b++)
{
if (!check.invoke(input[b]))
{
var whites = b - lastb - 1;
if (whites > 0)
{
if (outputBuffer == null)
{
outputBuffer = new char[input.Length - blackCount];
}
if (inputArray == null)
{
inputArray = input.ToCharArray();
}
Buffer.BlockCopy(
inputArray,
(lastb + 1) * 2,
outputBuffer,
whitePos * 2,
whites * 2);
whitePos += whites;
}
lastb = b;
blackCount++;
}
}
if (blackCount == 0)
{
return input;
}
var remaining = inputArray.Length - 1 - lastb;
if (remaining > 0)
{
Buffer.BlockCopy(
inputArray,
(lastb + 1) * 2,
outputBuffer,
whitePos * 2,
remaining * 2);
}
return new string(outputBuffer, 0, inputArray.Length - blackCount);
}
}
If you run the attempts you'll note that the performance improves as the functions get more specialised. Is there a faster and more generic way to perform this operation? Or if there is no generic option is there a way that is just faster?
Please note that I am not actually interested in removing 'X' and in practice the predicate is more complicated.