Parsing XML files that do not have 'root' node in Python

Question

My client wants me to parse over 100,00 xml files and converting them into a text file.

I have successfully parse a couple of files and converting them into a text file. However I managed to do that by editing the xml and adding <root></root> in the xml file.

This would seem inefficient since I would have to edit nearly 100,00 xml files to achieve my desired result.

Is there anyway for my python code to recognize the first node and read it as the root node?

I have tried using the method showed in Python XML Parsing without root ,however I do not fully understand it and I do not know where to implement this.

The XML format is as follows:

<Thread>
   <ThreadID></ThreadID>
   <Title></Title>
   <InitPost>
        <UserID></UserID>
        <Date></Date>
        <icontent></icontent>
  </InitPost>
  <Post>
       <UserID></UserID>
       <Date></Date>
       <rcontent></rcontent>
  </Post>
</Thread>

And this is my code on how to parse the XML files:

import os
from xml.etree import ElementTree


saveFile = open('test3.txt','w')

for path, dirs, files in os.walk("data/sample"):
   for f in files:
    fileName = os.path.join(path, f)
    with open(fileName, "r", encoding="utf8") as myFile:
        dom = ElementTree.parse(myFile)

        thread = dom.findall('Thread')

        for t in thread:

            threadID = str(t.find('ThreadID').text)
            threadID = threadID.strip()

            title = str(t.find('Title').text)
            title = title.strip()

            userID = str(t.find('InitPost/UserID').text)
            userID = userID.strip()

            date = str(t.find('InitPost/Date').text)
            date = date.strip()

            initPost = str(t.find('InitPost/icontent').text)
            initPost = initPost.strip()

        post = dom.findall('Thread/Post')

The rest of the code is just writing to the output text file.

score 2 · Accepted Answer · answered Feb 15 '20 at 10:38

2

Load the xml as text and wrap it with root element.

'1.xml' is the xml you have posted

from xml.etree import ElementTree as ET

files = ['1.xml'] # your list of files goes here
for file in files:
    with open(file) as f:
        # wrap it with <r>
        xml = '<r>' + f.read() + '</r>'
        root = ET.fromstring(xml)
        print('Now we are ready to work with the xml')

answered Feb 15 '20 at 10:38

balderman

22,927
7
34
52

Interesting. So you made a list from all of the files and then made it into a string? *Correct me on my understanding if I am wrong. Also, is there any reason why it did not go over the files in order? What I mean is, the files that I inserted, converted in random order. It did not follow the order in the folder. – Kamarul Adha Feb 15 '20 at 15:51
The order is not the point here. The point is that you load each file as string, add the root and parse it. – balderman Feb 15 '20 at 15:54

score 1 · Answer 2 · answered Feb 15 '20 at 08:35

1

I don't know if the Python parser supports DTDs, but if it does, then one approach is to define a simple wrapper document like this

<!DOCTYPE root [
<!ENTITY e SYSTEM "realdata.xml">
]>
<root>&e;</root>

and point the parser at this wrapper document instead of at realdata.xml

answered Feb 15 '20 at 08:35

Michael Kay

156,231
11
92
164

Unfortunately, the built-in `xml.etree.ElementTree` does not support DTD processing. The drop-in replacement library `lxml` does (see [example](https://stackoverflow.com/a/55308629/18771)). – Tomalak Feb 15 '20 at 09:13

score 1 · Answer 3 · answered Feb 15 '20 at 09:20

Not sure about Python, but generally speaking you can use SGML to infer missing tags, whether at the document element (root) level or elsewhere. The basic technique is creating a DTD for declaring the document element like so

<!DOCTYPE root [
  <!ELEMENT root O O ANY>
]>
<!-- your document character data goes here -->

where the important things are the O O (letter O) tag omission indicators telling SGML that both the start- and end-element tags for root can be omitted.

See also the following questions with more details:

Parsing XML files that do not have 'root' node in Python

3 Answers3