0

There is an XML document that is passed through a named pipe. The XML document is large, about 500 megabytes. The structure of the document is roughly like this:

<Root>
<SomeElement/>
<?pi?>
<NewMessage>
  <A>
   <B></B>
  </A>
</NewMessage>
<?pi?>
<NewMessage>
  <A>
    <B></B>
  </A>
</NewMessage> 
<?pi?>
<NewMessage>
  <A>
    <B></B>
  </A>
</NewMessage> 
</Root>

Each new message starts with a processing instruction. I want to be able to detect processing instructions before opening an XML reader, and open an XML reader for each message that starts with a processing instruction. I am doing this in order to check the tag balance within the message, and if the balance is not maintained, skip it. So, if there is a document like this:

<Root>
<SomeElement/>
<?pi?>
<NewMessage>
  <A>
   <B>
<?pi?>
<NewMessage>
  <A>
    <B></B>
  </A>
</NewMessage> 
<?pi?>
<NewMessage>
  <A>
    <B></B>
  </A>
</NewMessage> 
</Root>

Then the first message should be discarded, and all the remaining ones should be saved with the rest of XML document. So result will be:

<Root>
<SomeElement/>
<?pi?>
<NewMessage>
  <A>
    <B></B>
  </A>
</NewMessage> 
<?pi?>
<NewMessage>
  <A>
    <B></B>
  </A>
</NewMessage> 
</Root>

I want to use XML Reader with a name table, but it is not clear how to preprocess the stream in order to identify processing instructions in advance. Thank you in advance.

I tried something like that, but obviously there will be problems with large xml and not every message processing by xml reader

                        XmlNameTable nameTable = new NameTable();
                        byte[] buffer = new byte[4096];
                        int bytesRead = pipe.Read(buffer, 0, buffer.Length);
                        string input = Encoding.UTF8.GetString(buffer, 0, bytesRead);
                        Match piMatch = Regex.Match(input, "<\\?pi.*?\\?>");

                        if (piMatch.Success)
                        {
                            string pi = piMatch.Value;
                            string xml = input.Substring(piMatch.Index + piMatch.Length);
                            using (XmlReader reader = XmlReader.Create(new StringReader(xml), new XmlReaderSettings { NameTable = nameTable }))
                            {
                                while (reader.Read())
                                {
                                    // ...
                                }
                            }
                        }
incos
  • 1
  • I would explore using an `XmlReader` with the stream from the pipe to find the processing instructions. When a document is found delimited by PIs, it can be handed off to another instance of an XML parser. Your buffer chunking of 4096 may break a character sequence such that the regular expression doesn't recognize it. Additionally, it is always error prone and brittle to treat structured data (like XML) as if it unstructured. Always use an appropriate parser. – Jonathan Dodds Apr 10 '23 at 12:59
  • See my answer here : https://stackoverflow.com/questions/61607180/parse-big-xml-file-using-xmlreader?force_isolation=true – jdweng Apr 10 '23 at 16:26

0 Answers0