0

The following snippet converts xml data to csv data in a data processing application. element is a XElement. I'm currently trying to optimize the performance of the application and was wondering if I could somehow combine the two operations going on below: Ultimately I still want access to the string value of joined and the elements of values in a list because they are used later on for other purposes. Not sure if this is doable. Any help will be appreciated!

The first operation basically strips the XML data of all tags and just returns the text data between them or outside them. It also checks the formatting. The second operation takes the XML data and removes all line breaks and spaces if they exist within the first few characters of the data.

IEnumerable<string> values = new List<string>();
        values = element.DescendantNodes().OfType<XText>()
        .Select(v => Regex.Replace(v.Value, "\\s+", " ")).ToList();

string joined = string.Concat(element.ToString().Split().Take(3)) + string.Join(" ", element.
        ToString().Split().Skip(3));
sparta93
  • 3,684
  • 5
  • 32
  • 63

1 Answers1

0
IEnumerable<string> values = new List<string>();
values = …

Probably not going to be a big deal, but why create a new List<string>() just to throw it away. Replace this with either:

IEnumerable<string> values;
values = …

If you need values defined in a previous scope, or else just:

Enumerable<string> values = …

Then later on:

….Select(v => Regex.Replace(v.Value, "\\s+", " ")).ToList();

Do you really need it to be a list? Compare the speed with just:

….Select(v => Regex.Replace(v.Value, "\\s+", " "));

There are times when that's slower, and there are times when it just won't work, but there are a lot of times where ToList() is just a waste of time and memory.

string joined = string.Concat(
  element.ToString().Split().Take(3))
  + string.Join(" ", element.ToString().Split().Skip(3));

The first thing is why are you calling ToString() and Split() twice?:

var splitOnWhiteSpace = element.ToString().Split();
string joined = string.Concat(
  splitOnWhiteSpace.Take(3))
  + string.Join(" ", splitOnWhiteSpace.Skip(3));

We can probably optimise the Join too with a custom approach:

var elString = element.ToString();
var buffer = new StringBuilder(element.Length - 2); //Can't be larger, unlikely to be much smaller so obtain necessary space in advance.
using(var en = elString.Split().GetEnumerator())
{
  int count = 0;
  while(en.MoveNext() && ++count != 4)
    buffer.Append(en.Current);
  while(en.MoveNext())
    buffer.Append(en.Current).Append(' ');
}
string joined = buffer.ToString();

If this was being hit by several loops I'd consider holding onto the buffer between cycles (Clear() it after each use rather than creating a new one).

If the string being split was very large I might consider a custom version of Split() that iterated through the string only emitting which chunks it needed rather than creating an array in each pass, but I wouldn't worry about that until I'd tried the above more obvious improvements first.

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
  • Processing time for a GB of data went from ~7 minutes to ~6.4 minutes – sparta93 Jun 22 '15 at 18:42
  • If you replace your `XElement` approach to one that uses `XmlReader`, and does so in a streaming manner (`yield return`ing results as it gets them) you're likely to be able to do much better; `XmlReader` is fiddlier to use, but generally faster. – Jon Hanna Jun 22 '15 at 18:48
  • This is the approach I'm currently using, it does use a `XmlReader`; check answer in post http://stackoverflow.com/questions/30916104/converting-very-large-files-from-xml-to-csv – sparta93 Jun 22 '15 at 19:00