0

I have many hundreds of XML files that are processed one after another. I want to apply other, existing code to the data in the files. However, that code is written to expect a single file or a stream representing a single file.

I had a look at

How do I concatenate two System.Io.Stream instances into one?

However, the StreamEnumerator presented in Marc's answer requires streams to be opened to all the files in question at once. That doesn't seem like a good approach, given the large number of files in my case.

The existing code consumes the stream like this:

XmlReader reader = XmlReader.Create(xmlStream);

Is there a better way to combine the many files into a single stream?

Community
  • 1
  • 1
Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • The method seems OK. I think that you will have to add at the beginning of the stream something like "-", remove the "" when adding each file and add "<\global_doc>" at the end of the stream. – Graffito Jan 07 '17 at 00:05
  • I'm trying to avoid having hundreds or perhaps a few thousand streams opened at once. – Eric J. Jan 07 '17 at 00:06
  • The method consist in creating a single xmlStream to be consumed by the existing code. Run a separate thread to write data on the stream by reading one adter the other the files to be merged. After having launched the separate thread, execute "XmlReader reader = XmlReader.Create(xmlStream);". – Graffito Jan 07 '17 at 00:12
  • @HenkHolterman: I'm trying to find a general solution first, then will tweak it to seamlessly combine the XML data. – Eric J. Jan 10 '17 at 23:34

1 Answers1

1

Well, I would write own class that extends System.IO.Stream and by overloading CanRead and Read methods join those streams on demand. Something like this (just a stub of concept, you need to fine tune this code):

using System;
using System.Diagnostics;
using System.IO;
using System.Xml;

namespace ConsoleApplication1
{

    public class CombinedXmlStream : Stream
    {
        private Stream currentStream, startStream, endStream;
        private String[] files;
        private int currentFile = -2;
        private bool endReached = false;

        private static Stream ToStream(String str)
        {
            MemoryStream stream = new MemoryStream();
            StreamWriter writer = new StreamWriter(stream);
            writer.Write(str);
            writer.Flush();
            stream.Position = 0;
            return stream;
        }

        public CombinedXmlStream(String start, String end, params String[] files)
        {
            this.files = files;
            startStream = ToStream(start);
            endStream = ToStream(end);

        }

        public override bool CanRead { get { return true; } }

        public override bool CanSeek { get { return false; } }

        public override bool CanWrite { get { return false; } }

        public override long Length { get { throw new NotImplementedException(); } }

        public override long Position { get { return 0; } set { } }

        public override void Flush() { throw new NotImplementedException(); }

        public override long Seek(long offset, SeekOrigin origin) { throw new NotImplementedException(); }

        public override void SetLength(long value) { throw new NotImplementedException(); }

        public override void Write(byte[] buffer, int offset, int count) { throw new NotImplementedException(); }

        public override int Read(byte[] buffer, int offset, int count)
        {
            doSwitching();

            int output = currentStream.Read(buffer, offset, count);

            if (output == 0)
            {
                doSwitching(true);
                if (currentStream != null)
                {
                    return Read(buffer, offset, count);
                }
            }

            return output;
        }

        private void doSwitching(bool force = false)
        {
            if (force || currentStream == null || !currentStream.CanRead)
            {
                if (currentStream != null)
                {
                    currentStream.Close();
                    currentStream = null;
                }

                currentFile++;
                if (currentFile == -1)
                {
                    currentStream = startStream;
                }
                else if (currentFile >= files.Length && !endReached)
                {
                    currentStream = endStream;
                    endReached = true;
                }
                else if (!endReached)
                {
                    currentStream = new FileStream(files[currentFile], FileMode.Open);
                }
            }
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            Debug.WriteLine("Test me");
            using (XmlReader reader = XmlReader.Create(new CombinedXmlStream("<combined>", "</combined>", @"D:\test.xml", @"D:\test2.xml")))
            {
                //reader.MoveToContent();
                while (reader.Read())
                {
                    if (reader.NodeType == XmlNodeType.Element)
                    {
                        Debug.WriteLine("Node: " + reader.Name);
                    }
                }
            }
        }
    }
}
Vir
  • 642
  • 3
  • 10
  • Seems this would only work if Read() requests don't cross the boundary between two physical files. It also seems you assume one call to CanRead for each stream, which seems unlikely. – Eric J. Jan 09 '17 at 18:21
  • Did you even check? CanRead is called multiple times after each read (test yourself by simply exteding FileStream and overriding those two methods and using breakpoints and/or debug console output). Next, boundary isn't a problem but yes in this example you're right. But then again you can ignore CanRead and do all the logic inside Read method. Principle stays the same: create own stream implementation that does the merging of files in right time. – Vir Jan 09 '17 at 19:34
  • @EricJ. Reedited my example with full working proof of concept. Create two XML files (but they can't have preamble - that's just POC so that one you're need to add yourself) and this will combine those streams. Don't worry about boundary between files - Read is called until it reaches EOF and then there's switching involved. Also note that this code is just POC writen in 10 minutes or so :) – Vir Jan 09 '17 at 20:14
  • Thanks for implementing the logic to switch streams during a read. I'll try it out. – Eric J. Jan 09 '17 at 21:22