0

(Problem solved. See my answer bellow.)

I just did a profile for my project(winform / C#) because I felt that it worked much slower than before. It is strange that List.AddRange() costs 92% of the total profiling process.

Code1: With the following code, it takes 2m30s to finish a scan job(not in profiling mode):

        var allMatches = new List<Match>();
        foreach (var typedRegex in Regexes)
        {
            var ms = typedRegex.Matches(text); //typedRegex is just Regex.
            allMatches.AddRange(ms);
        }

Function Name Total CPU [unit, %] Self CPU [unit, %] Module Category |||||||||||||||| - [External Call] System.Collections.Generic.List.InsertRange(int, System.Collections.Generic.IEnumerable<!0>) 146579 (92.45%) 146579 (92.45%) Multiple modules IO | Kernel

Code2: So I removed the AddRange, and it costs only 1.6s:

        var allMatches = new List<Match>();
        foreach (var typedRegex in Regexes)
        {
            var ms = typedRegex.Matches(text);
            // allMatches.AddRange(ms);
        }

Code3: Thinking that there might be some kind of "lazy load" mechanism, I added a counter to trigger the Regex.Maches(). And the value of the counter is displayed in the UI. Not it takes 9s:

        public static int Count = 0;
        var allMatches = new List<Match>();
        foreach (var typedRegex in Regexes)
        {
            var ms = typedRegex.Matches(text);
            // allMatches.AddRange(ms);
            Count += ms.Count;
        }

Code4: Noticing the value of Count is 32676, so I pre-allocated memories for the list. Now it still costs 9s:

        public static int Count = 0;
        var allMatches = new List<Match>(33000);
        foreach (var typedRegex in Regexes)
        {
            var ms = typedRegex.Matches(text);
            // allMatches.AddRange(ms);
            Count += ms.Count;
        }

Code5: Thinking List.AddRange(MatchCollection) might sound strange, I changed the code to foreach(...) {List.Add(match)}, but nothing happened, 2m30s. The profile says Function Name Total CPU [unit, %] Self CPU [unit, %] Module Category |||||||||||||||| - [External Call] System.Text.RegularExpressions.MatchCollection.MatchCollection+Enumerator.MoveNext() 183804 (92.14%) 183804 (92.14%) Multiple modules IO | Kernel

Code6: SelectMany cost 2m30s as well. It's my oldest solution.

    var allMatches = Regexes.SelectMany(i => i.Matches(text)); 

So, maybe creating a list up to 32676 items is a big deal, but 10 times more than creating those Match is out of imagination. It cost 27s to finish the job just 1 day before. I made a lot of changes today, and thought the profiler would tell me why. But it didn't. That AddRange() was there 1 month before. I can barely remember it's name from any profiles before.

I will try to remember what happened during the day. But could anybody explain the profile result above? Thanks for any help.

cheny
  • 2,545
  • 1
  • 24
  • 30

1 Answers1

0

Finally, it's not a problem of AddRange(), but the Regex.Matches(). Time cost dropped from 2m30s to less 11s, after I optimized the regex.

First of all, Regex.Matches() IS using some kind of Lazy Load (and multi-threads ). That's why it returns MatchCollection rather than a normal list. MatchCollection creates a item only when you use the item.

MatchCollection.Count() costs less than ToArray(), just like IEnumerable.Count() costs less than IEnumerable.ToArray() (less garbage collected?).

Here is code from MatchCollection:

private Match GetMatch(int i)
{
  if (this._matches.Count > i)
    return this._matches[i];
  if (this._done)
    return (Match) null;
  Match match;
  do
  {
    match = this._regex.Run(false, this._prevlen, this._input, 0, this._input.Length, this._startat);
    if (!match.Success)
    {
      this._done = true;
      return (Match) null;
    }
    this._matches.Add(match);
    this._prevlen = match.Length;
    this._startat = match._textpos;
  }
  while (this._matches.Count <= i);
  return match;
}

And it's so lazy that if you ask for the 2nd item, it never works on the third.

cheny
  • 2,545
  • 1
  • 24
  • 30