My client wants me to parse over 100,00 xml files and converting them into a text file.
I have successfully parse a couple of files and converting them into a text file. However I managed to do that by editing the xml and adding <root></root>
in the xml file.
This would seem inefficient since I would have to edit nearly 100,00 xml files to achieve my desired result.
Is there anyway for my python code to recognize the first node and read it as the root node?
I have tried using the method showed in Python XML Parsing without root ,however I do not fully understand it and I do not know where to implement this.
The XML format is as follows:
<Thread>
<ThreadID></ThreadID>
<Title></Title>
<InitPost>
<UserID></UserID>
<Date></Date>
<icontent></icontent>
</InitPost>
<Post>
<UserID></UserID>
<Date></Date>
<rcontent></rcontent>
</Post>
</Thread>
And this is my code on how to parse the XML files:
import os
from xml.etree import ElementTree
saveFile = open('test3.txt','w')
for path, dirs, files in os.walk("data/sample"):
for f in files:
fileName = os.path.join(path, f)
with open(fileName, "r", encoding="utf8") as myFile:
dom = ElementTree.parse(myFile)
thread = dom.findall('Thread')
for t in thread:
threadID = str(t.find('ThreadID').text)
threadID = threadID.strip()
title = str(t.find('Title').text)
title = title.strip()
userID = str(t.find('InitPost/UserID').text)
userID = userID.strip()
date = str(t.find('InitPost/Date').text)
date = date.strip()
initPost = str(t.find('InitPost/icontent').text)
initPost = initPost.strip()
post = dom.findall('Thread/Post')
The rest of the code is just writing to the output text file.