Estimating the iteration element in c# for very large xml files

Question

I am working with a multitude of different xml files where I do not know the iteration element within the file.

What I mean with iteration element is the element that is repeated throughout the xml file (also seen in xsd-fiels as maxOccurs="unbounded").

For example an orders file might contain a repeated element called order

Some examples of the structures I receive are

<order>
   <order>...</order>
   <order>...</order>
</orders>

<products>
   <product>...</product>
   <product>...</product>
</products>

<root>
   <element>...</element>
   <element>...</element>
</root>

<products>
   <section>
    <someelement>content</someelement>
    <item>...</item>
    <item>...</item>
    <item>...</item>
    <item>...</item>
   </section>
</products>

In the above example the iterators/repeaters are called:

orders > order
products > product
root > element
products > section > item

My usual way to estimate the iterator is to load the full xml file into an xmldocument from that generate and xsd schema and from it find the first maxOccurs with subelements within it. This works fine, but using xmldocument doesn't work well with very large xml files (gb-size).

For these I need to use a xmlreader, but I have no idea on how I could approach the estimation of the iterator with a xmlreader since I can't use the xsd trick.

So looking for input on how to estimate it, any ideas are appreciated

hmm, i was actually referring to the indenting or lack of... — jazb, Nov 01 '18 at 08:27
This is a specification/requirements problem, not a coding problem. If you can provide a precise specification of what you mean by "iteration" element, then coding to that spec will be easy. The challenge is that the concept as you have described it is a very fuzzy one, and there are many XML documents to which it does not apply. For example, in a scientific article, would you be looking for the sections or for the paragraphs? — Michael Kay, Nov 01 '18 at 10:04
The complexity level I usually face is as described above, and as I wrote, I go for the first maxoccurs = unbound. It is a coding issue in my world — Dennis C, Nov 01 '18 at 10:48

jdweng · Accepted Answer · 2018-11-01T13:11:55.547

Try following code which puts results into a dictionary

using System;
using System.Collections.Generic;
using System.Collections;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;


namespace ConsoleApplication75
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            Node.ParseChildren(FILENAME);
        }


    }
    public class Node
    {
        public static XmlReader reader;
        public static Dictionary<string, int> dict = new Dictionary<string, int>();

        public static void ParseChildren(string filename)
        {
            reader = XmlReader.Create(filename);
            reader.MoveToContent();
            string name = "";
            reader.ReadStartElement();
            ParseChildrenRecursive(name);
        }

        public static void ParseChildrenRecursive(string path)
        {
            while (!reader.EOF)
            {
                if (reader.NodeType == XmlNodeType.EndElement)
                {
                    reader.ReadEndElement();
                    break;
                }
                if (reader.IsStartElement())
                {
                    string childName = reader.LocalName;
                    string newPath = path + " > " + childName;
                    if(dict.ContainsKey(newPath))
                    {
                        dict[newPath] += 1;
                    }
                    else
                    {
                        dict.Add(newPath, 1);
                    }
                    reader.ReadStartElement();
                    ParseChildrenRecursive(newPath);
                }
                if ((reader.NodeType != XmlNodeType.StartElement) && (reader.NodeType != XmlNodeType.EndElement))
                   reader.Read();
            }
        }
    }

}

The last statement may be reading a start or end element ( reader.Read();) so you may only want to read if it is not an end element. if ((reader.NodeType != XmlNodeType.StartElement) && (reader.NodeType != XmlNodeType.EndElement) ) reader.Read(); — jdweng, Nov 01 '18 at 13:07

Estimating the iteration element in c# for very large xml files

1 Answers1