0

Given: a base64 string

When: Foo is called

Then: Foo returns the input string with the following character replacements ('A' => '_', 'B' => '-', 'C' => '+') and does it as fast as possible

I compared several algorithms to determine which version of Foo is faster. The results point towards plain-old string.Replace, which is quite surprising. I would have expected the Regex to take an initial hit for compiling, but then blaze through and outperform string.Replace which creates three copies of the string per invocation of Foo.

I'd like to check if anybody else can confirm these findings or come up with an explanation of why the winner outperformed the rest.

I ran Foo 100k times with these algorithms and the result is the TimeSpan, measured with StopWatch on a debug build, after finishing execution:

00:00:00.0500790 <=== string.Replace [1]
00:00:00.0699696 <=== StringBuilder.Append [2]
00:00:00.0988960 <=== StringBuilder.Replace [3]
00:00:00.7440135 <=== Regex [4]

[1]:

Foo(string input) 
{ 
  return input.Replace("A", "_").Replace("B", "-").Replace("C", "+"); 
}

[2]:

Foo(string input)
{
    var sb = new StringBuilder(input.Length);
    foreach (var x in input)
    {
        if (x == 'A')
        {
            sb.Append('_');
        }
        else if (x == 'B')
        {
            sb.Append('-');
        }
        else if (x == 'C')
        {
            sb.Append('+');
        }
        else
        {
            sb.Append(x);
        }
    }
    return sb.ToString();
}

[3]:

Foo(string input) 
{
  return new StringBuilder(input, input.Length).Replace("A", "_").Replace("B", "-").Replace("C", "+").ToString()
}

[4]:

static readonly Regex charsRegex = new Regex(@"[ABC]", RegexOptions.Compiled);
Foo(string input)
{
  charsRegex.Replace(input, delegate (Match m)
    {
        var value = m.Value;
        if (value == "A")
        {
            return "_";
        }
        else if (value == "B")
        {
            return "-";
        }
        else if (value == "C")
        {
            return "+";
        }

        return value;
    });
}
mo5470
  • 937
  • 3
  • 10
  • 26
  • Note that your Regex code specified IgnoreCase, while the other solutions don't. Have you tried the regex without it? Or alternate, tried the others case-insensitively (which might require x2 the calls) – Avner Shahar-Kashtan Jun 25 '16 at 19:24
  • Yeah, that's actually not supposed to be there. Edited – mo5470 Jun 25 '16 at 19:25
  • So the timings are for code *without* the IgnoreCase? Could you verify that the other snippets match the ones that you timed? It might send people guessing in the wrong direction. – Avner Shahar-Kashtan Jun 25 '16 at 19:26
  • Also, how did you time? Were you using `StopWatch`? – Avner Shahar-Kashtan Jun 25 '16 at 19:28
  • 2
    Also, is this compiled with, or without debugging information and optimization? I would expect the chained replace to be optimized by the compiler to do it in place since there are no intermediate variables. So while you get a new string returned, it doesn't necessarily create a string, replace all of one, return new string, replace all of another... but only an ILDump could tell – Eris Jun 25 '16 at 19:39
  • 1
    StringBuilder will be faster than string.Replace if you replace more characters. Also, the new Roslyn compiler in Visual Studio 2015 can optimize the executable much better than the JIT compilation. – Slai Jun 25 '16 at 20:56

4 Answers4

2

I would like to suggest another implementations.

public /*unsafe*/ static string Foo(string text)
{
    char[] a = text.ToCharArray();
    for(int i = 0; i < a.Length; i++)
        switch(a[i])
        {
        case 'A': a[i] = '_'; break;
        case 'B': a[i] = '-'; break;
        case 'C': a[i] = '+'; break;
        }
    return new string(a);
}

OR

public /*unsafe*/ static string Foo(string text)
{
    char[] a = new char[text.Length];
    for(int i = 0; i < text.Length; i++)
    {
        char c=text[i];
        switch(c)
        {
        case 'A': a[i] = '_'; break;
        case 'B': a[i] = '-'; break;
        case 'C': a[i] = '+'; break;
        default: a[i] = c; break;
        }
    }
    return new string(a);
}

If you allow unsafe code and uncomment unsafe, this could be even faster then [1].

[1] wins because it's all native, although 3 loops though the data [2] many index checks and current index increases [3] multiple loops thru the same data, many index checks, but inplace replacement possible) [4] last, because overhead of the state-machine, and calling the replacement method. Plus string compares, but not char compares.

lexx9999
  • 736
  • 3
  • 9
  • The switch statement can be simplified to `if ( a[i] <= 'C' && a[i] >= 'A' ) a[i] = "_-+"[a[i] - 'A'];` because the switch statement will most likely be compiled to if else blocks for such small number of cases. – Slai Jun 26 '16 at 14:14
1

It seems to me that the regex is simply far more complicated.

String.Replace calls directly into Win32 and probably does a pointer based string manipulation preventing fragmentation etc. (difficult to be sure - but it doesnt do this in managed code). If I fire up ILSpy I see that RegEx.Replace does a lot of bounds checking, then does a Match, then uses StringBuilder to carry out the results of the call to your delegate.

PhillipH
  • 6,182
  • 1
  • 15
  • 25
1

If we check the implementation for the methods you specify, we'll end up with nothing surprising.

Regex.Replace includes pattern matching and a lot of string concatenations which results in an overhead. While the plain old String.Replace use C++ implementation directly(comstring.cpp file) which is low-level and most probably very optimized.

Zein Makki
  • 29,485
  • 6
  • 52
  • 63
1

Yeah I was surprised by what You found as well... Not that Regex is the slowest, that was expected... but that chained String.Replaceperformed as well as StringBuilder. I did some checking of my own I compared same [1] as you did but I modified [2] to get it as close to bare bone implementation as I could O(n).

[1]

    static string Foo(string input)
    {
        string result = input.Replace("A", "_");
        result = result.Replace("B", "-");
        result = result.Replace("C", "+");
        return result;
    }

[2]

    static string Foo2(string input)
    {
        var length = input.Length;
        var sb = new char[length];
        for (int i = 0; i < length; i++)
        {
            switch (input[i])
            {
                case 'A':
                    sb[i] = '_';
                    break;
                case 'B':
                    sb[i] = '-';
                    break;
                case 'C':
                    sb[i] = '+';
                    break;
                default:
                    sb[i] = input[i];
                    break;
            }
        }
        return sb.ToString();
    }

My test string is of above 7 million characters in length (7230872, Lorem Ipsum). So we can notice several points:

  1. Execution of these two methods is extremely fast (55ms vs 51ms aprox) As such activity of other processes and CPU availability play significant role in end result. I executed these two methods 100 times and summed all execution times and got aprox 5500ms for Foo vs 5100 for Foo2.
  2. Foo method goes through whole string at the least 3 times (maybe more, not sure). Foo2 method on the other hand looks like it is only going through string once... But that is not true... it goes through it at the least twice... input.Length goes through it as well (to count number of characters). Thanks PhillipH
  3. As you try replacing more and more characters difference between Foo and Foo2 becomes more apparent. With 15 char replacements Foo executes in aprox 240ms and Foo2 in about 60ms.

So... In conclusion... No magic here, it just executes extremely fast... :)

Igor
  • 3,054
  • 1
  • 22
  • 28
  • 1
    String.Length does not iterate over the string. Other languages use a null terminator which does require iteration, but nulls embedded in a .net string are valid characters, so the length is not found via iteration to a terminator. See http://stackoverflow.com/questions/717801/is-string-length-in-c-sharp-net-instant-variable for a discussion of this. Becasue strings are immutable, once created, their length cannot be changed so it does not need to be dynamically calculated. – PhillipH Jun 26 '16 at 19:50