0

Given a third party system that streams XML to me via TCP. The TOTAL transmitted XML content (not one message of the stream, but concatenated messages) looks like this :

   <root>
      <insert ....><remark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</remark></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

Every line of the above sample is individually processable. Since it is a streaming process, I cannot just wait out until everything arrives, I have to process the content as it comes. The problem is the content chunks can be sliced by any point, no tags are respected. Do you have some good advice on how to process the content if it arrives in fragments like this?

Chunk 1:

  <root>
      <insert ....><rem

Chunk 2:

                      ark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</rema

Chunk N:

                                    rk></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

EDIT:

While processing speed is not a concern (no realtime troubles), I cannot wait for the entire message. Practically the last chunk never arrives. The third party system sends messages whenever it encounters changes. The process never ends, it is a stream that never stops.

user256890
  • 3,396
  • 5
  • 28
  • 45
  • Do you need to process it in real-time or can you wait until you get the complete content? In other words, is this problem about processing XML fragments or joining data from a stream in the most elegant/efficient way? – daveaglick Jun 23 '11 at 13:44
  • 1
    I think if you can use blocking reads you can use the `XmlReader` class. No idea how to do it with non blocking IO. – CodesInChaos Jun 23 '11 at 13:45
  • 1
    Sounds to me like you are going to need to use some string manipulation such that when you receive a chunk, you take out the processable portions, process them (asynch would be good if you can) and then append the next chunk to whatever is left, and loop round like that. – Duncan Howe Jun 23 '11 at 13:55

2 Answers2

2

My first thought for this problem is to create a simple TextReader derivative that is responsible for buffering input from the stream. This class would then be used to feed an XmlReader. The TextReader derivative could fairly easily scan the incoming content looking for complete "blocks" of XML (a complete element with starting and ending brackets, a text fragment, a full attribute, etc.). It could also provide a flag to the calling code to indicate when one or more "blocks" are available so it can ask for the next XML node from the XmlReader, which would trigger sending that block from the TextReader derivative and removing it from the buffer.

Edit: Here's a quick and dirty example. I have no idea if it works perfectly (I haven't tested it), but it gets across the idea I was trying to convey.

public class StreamingXmlTextReader : TextReader
{
    private readonly Queue<string> _blocks = new Queue<string>();
    private string _buffer = String.Empty;
    private string _currentBlock = null;
    private int _currentPosition = 0;

    //Returns if there are blocks available and the XmlReader can go to the next XML node
    public bool AddFromStream(string content)
    {
        //Here is where we would can for simple blocks of XML
        //This simple chunking algorithm just uses a closing angle bracket
        //Not sure if/how well this will work in practice, but you get the idea
        _buffer = _buffer + content;
        int start = 0;
        int end = _buffer.IndexOf('>');
        while(end != -1)
        {
            _blocks.Enqueue(_buffer.Substring(start, end - start));
            start = end + 1;
            end = _buffer.IndexOf('>', start);
        }

        //Store the leftover if there is any
        _buffer = end < _buffer.Length
            ? _buffer.Substring(start, _buffer.Length - start) : String.Empty;

        return BlocksAvailable;
    }

    //Lets the caller know if any blocks are currently available, signaling the XmlReader can ask for another node
    public bool BlocksAvailable { get { return _blocks.Count > 0; } }

    public override int Read()
    {
        if (_currentBlock != null && _currentPosition < _currentBlock.Length - 1)
        {
            //Get the next character in this block
            return _currentBlock[_currentPosition++];
        }
        if(BlocksAvailable)
        {
            _currentBlock = _blocks.Dequeue();
            _currentPosition = 0;
            return _currentBlock[0];
        }
        return -1;
    }
}
daveaglick
  • 3,600
  • 31
  • 45
0

After further investigation we figured out that the XML stream has been sliced up by the TCP buffer, whenever it got full. Therefore, slicing happened actually randomly in the byte stream causing cuts even inside unicode characters. Therefore, we had to assemble the parts on byte level and convert that back to text. Should converstion fail, we waited for the next byte chunk, and tried again.

user256890
  • 3,396
  • 5
  • 28
  • 45