1

My client wants me to parse over 100,00 xml files and converting them into a text file.

I have successfully parse a couple of files and converting them into a text file. However I managed to do that by editing the xml and adding <root></root> in the xml file.

This would seem inefficient since I would have to edit nearly 100,00 xml files to achieve my desired result.

Is there anyway for my python code to recognize the first node and read it as the root node?

I have tried using the method showed in Python XML Parsing without root ,however I do not fully understand it and I do not know where to implement this.

The XML format is as follows:

<Thread>
   <ThreadID></ThreadID>
   <Title></Title>
   <InitPost>
        <UserID></UserID>
        <Date></Date>
        <icontent></icontent>
  </InitPost>
  <Post>
       <UserID></UserID>
       <Date></Date>
       <rcontent></rcontent>
  </Post>
</Thread>

And this is my code on how to parse the XML files:

import os
from xml.etree import ElementTree


saveFile = open('test3.txt','w')

for path, dirs, files in os.walk("data/sample"):
   for f in files:
    fileName = os.path.join(path, f)
    with open(fileName, "r", encoding="utf8") as myFile:
        dom = ElementTree.parse(myFile)

        thread = dom.findall('Thread')

        for t in thread:

            threadID = str(t.find('ThreadID').text)
            threadID = threadID.strip()

            title = str(t.find('Title').text)
            title = title.strip()

            userID = str(t.find('InitPost/UserID').text)
            userID = userID.strip()

            date = str(t.find('InitPost/Date').text)
            date = date.strip()

            initPost = str(t.find('InitPost/icontent').text)
            initPost = initPost.strip()

        post = dom.findall('Thread/Post')

The rest of the code is just writing to the output text file.

Kamarul Adha
  • 113
  • 1
  • 2
  • 10

3 Answers3

2

Load the xml as text and wrap it with root element.

'1.xml' is the xml you have posted

from xml.etree import ElementTree as ET

files = ['1.xml'] # your list of files goes here
for file in files:
    with open(file) as f:
        # wrap it with <r>
        xml = '<r>' + f.read() + '</r>'
        root = ET.fromstring(xml)
        print('Now we are ready to work with the xml')
balderman
  • 22,927
  • 7
  • 34
  • 52
  • Interesting. So you made a list from all of the files and then made it into a string? *Correct me on my understanding if I am wrong. Also, is there any reason why it did not go over the files in order? What I mean is, the files that I inserted, converted in random order. It did not follow the order in the folder. – Kamarul Adha Feb 15 '20 at 15:51
  • The order is not the point here. The point is that you load each file as string, add the root and parse it. – balderman Feb 15 '20 at 15:54
1

I don't know if the Python parser supports DTDs, but if it does, then one approach is to define a simple wrapper document like this

<!DOCTYPE root [
<!ENTITY e SYSTEM "realdata.xml">
]>
<root>&e;</root>

and point the parser at this wrapper document instead of at realdata.xml

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Unfortunately, the built-in `xml.etree.ElementTree` does not support DTD processing. The drop-in replacement library `lxml` does (see [example](https://stackoverflow.com/a/55308629/18771)). – Tomalak Feb 15 '20 at 09:13
1

Not sure about Python, but generally speaking you can use SGML to infer missing tags, whether at the document element (root) level or elsewhere. The basic technique is creating a DTD for declaring the document element like so

<!DOCTYPE root [
  <!ELEMENT root O O ANY>
]>
<!-- your document character data goes here -->

where the important things are the O O (letter O) tag omission indicators telling SGML that both the start- and end-element tags for root can be omitted.

See also the following questions with more details:

imhotap
  • 2,275
  • 1
  • 8
  • 16