6

I have a use case where I need to:

  • iterate through each Input node in an Xml document
  • perform a time-intensive calculation on each Input, and
  • write the results to an XML file.

Input looks something like this:

<Root>
  <Input>
    <Case>ABC123</Case>
    <State>MA</State>
    <Investor>Goldman</Investor>
  </Input>
  <Input>
    <Case>BCD234</Case>
    <State>CA</State>
    <Investor>Goldman</Investor>
  </Input>
</Root>

and the output:

<Results>
  <Output>
    <Case>ABC123</Case>
    <State>MA</State>
    <Investor>Goldman</Investor>
    <Price>75.00</Price>
    <Product>Blah</Product>
  </Output>
  <Output>
    <Case>BCD234</Case>
    <State>CA</State>
    <Investor>Goldman</Investor>
    <Price>55.00</Price>
    <Product>Ack</Product>
  </Output>
</Results>

I would like to run the calculations in parallel; the typical input file may have 50,000 input nodes, and the total processing time without threading may be 90 minutes. Approximately 90% of the processing time is spent on step #2 (the calculations).

I can iterate over the XmlReader in parallel easily enough:

static IEnumerable<XElement> EnumerateAxis(XmlReader reader, string axis)
{
  reader.MoveToContent();
  while (reader.Read())
  {
    switch (reader.NodeType)
    {
      case XmlNodeType.Element:
        if (reader.Name == axis)
        {
          XElement el = XElement.ReadFrom(reader) as XElement;
          if (el != null)
            yield return el;
        }
        break;
    }
  }
}
...
Parallel.ForEach(EnumerateAxis(reader, "Input"), node =>
{ 
  // do calc
  // lock the XmlWriter, write, unlock
});

I'm currently inclined to use a lock when writing to the XmlWriter to ensure thread safety.

Is there a more elegant way to handle the XmlWriter in this case? Specifically, should I have the Parallel.ForEach code pass the results back to the originating thread and have that thread handle the XmlWriter, avoiding the need to lock? If so, I'm unsure of the correct approach for this.

Eric Patrick
  • 2,097
  • 2
  • 20
  • 31
  • do you need to consume your file immediately or it's used by an another process – BRAHIM Kamel Jun 28 '14 at 13:41
  • You can use a ConcurrentDictionary Class (http://msdn.microsoft.com/en-us/library/dd287191%28v=vs.110%29.aspx); if is a unique identifier, then you can use it as a key (read entire XML doc, run calculation on nodes using TPL adding results to ConcurrentDictionary, and finally, wright back the output XML doc based on that ConcurrentDictionary. Rgds, – Alexander Bell Jun 28 '14 at 14:01
  • 1
    Provided that ordering of the results is insignificant you have it pretty much spot-on, and that includes locking the `XmlWriter`. I do, however, like your way of thinking with regard to delegating your `XmlWriter` duties to a dedicated thread in a pipeline-like fashion. If you want to take that approach all the way (and this may or may not result in a performance boost - it's wildly case-specific), you can even look at turning this into a 3-stage async pipeline (read, process, write) with parallelisation in stage 2. – Kirill Shlenskiy Jun 28 '14 at 14:23
  • I do not need to consume the output of the XmlWriter immediately. R.e. the ConcurrentDictionary, I worry about memory consumption given the 50K input nodes. My sample is small, but in the real use cases, the input nodes may be much bigger. However, I'll check it out! – Eric Patrick Jun 28 '14 at 15:05
  • Kirill, I would indeed like to handle the write stage in a single thread. I suppose I'm seeking the opposite of 'yield'. As I think through how I'd do this, my gut tells me there's probably a nice solid pattern for this and I'm reinventing a wheel. – Eric Patrick Jun 28 '14 at 15:12
  • Alex, thanks for the feedback. is not necessarily unique, but I could use your approach with SynchronizedCollection: http://msdn.microsoft.com/en-us/library/ms668265.aspx. I'll compare that performance to the lock / write / unlock approach. – Eric Patrick Jun 28 '14 at 15:15
  • I sugged you to use XElement.ReadFrom inside parallel loop and not in EnumerateAxis method. Because it is more cpu consuming. Use ReadOuterXml in enumerate. – Artur Alexeev May 25 '22 at 13:45

2 Answers2

9

This is my favourite kind of problem: one which can be solved with a pipeline.

Please note that depending on your circumstances this approach may actually negatively impact performance, but as you've explicitly asked how you can use the writer on a dedicated thread, the below code demonstrates exactly that.

Disclaimer: you should ideally consider TPL Dataflow for this, but it's not something I'm well-versed in so I'll just take the familiar Task + BlockingCollection<T> route.

At first I was going to suggest a 3-stage pipeline (read, process, write), but then I realised that you've already combined the first two stages with the way you "stream" the nodes as they're being read and feeding them to your Parallel.ForEach (yes, you've already implemented a pipeline of sorts). Even better - less thread synchronisation.

With that in mind, the code now becomes:

public class Result
{
    public string Case { get; set; }
    public string State { get; set; }
    public string Investor { get; set; }
    public decimal Price { get; set; }
    public string Product { get; set; }
}

...

using (var reader = CreateXmlReader())
{
    // I highly doubt that this collection will
    // ever reach its bounded capacity since
    // the processing stage takes so long,
    // but in case it does, Parallel.ForEach
    // will be throttled.
    using (var handover = new BlockingCollection<Result>(boundedCapacity: 100))
    {
        var processStage = Task.Run(() =>
        {
            try
            {
                Parallel.ForEach(EnumerateAxis(reader, "Input"), node =>
                {
                    // Do calc.
                    Thread.Sleep(1000);

                    // Hand over to the writer.
                    // This handover is not blocking (unless our 
                    // blocking collection has reached its bounded
                    // capacity, which would indicate that the
                    // writer is running slower than expected).
                    handover.Add(new Result());
                });
            }
            finally
            {
                handover.CompleteAdding();
            }
        });

        var writeStage = Task.Run(() =>
        {
            using (var writer = CreateXmlReader())
            {
                foreach (var result in handover.GetConsumingEnumerable())
                {
                    // Write element.
                }
            }
        });

        // Note: the two stages are now running in parallel.
        // You could technically use Parallel.Invoke to
        // achieve the same result with a bit less code.
        Task.WaitAll(processStage, writeStage);
    }
}
Kirill Shlenskiy
  • 9,367
  • 27
  • 39
  • 4
    +1. Pipelining is highly under-appreciated. I see a lot of overly complex threading scenarios that are much more elegantly solved with a simple pipeline. Good answer. – Jim Mischel Jun 28 '14 at 15:51
0
struct nodeParams
{
    internal string State;
    internal string Investor;
    internal double Price;
    internal string Product;
}

internal ConcurrentDictionary<string, nodeParams> cd = new ConcurrentDictionary<string, nodeParams>();

Then modify your code:

static IEnumerable<XElement> EnumerateAxis(XmlReader reader, string axis)
{
  reader.MoveToContent();
  while (reader.Read())
  {
    switch (reader.NodeType)
    {
      case XmlNodeType.Element:
        if (reader.Name == axis)
        {
          XElement el = XElement.ReadFrom(reader) as XElement;
          if (el != null)
            yield return el;
        }
        break;
    }
  }
}
...
Parallel.ForEach(EnumerateAxis(reader, "Input"), node =>
{ 
  nodeParams np = new nodeParams();

// do calc and put result in np and add to cd using Case as a Key

  });

// Update XML doc based on the content of cd
Alexander Bell
  • 7,842
  • 3
  • 26
  • 42