I have multiple large files that I need to import and iterate through them - all of them are xmls and have the same tree structure. The structure is something like this with some extra text apart from the ID so under the Start there are more children element tags: What I would like to do, is to input a list of Ids which I know is wrong and remove that report from the whole XML file. One report is between two "T"s.
<Header>
<Header2>
<Header3>
<T>
<Start>
<Id>abcd</Id>
</Start>
</T>
<T>
<Start>
<Id>qrlf</Id>
</Start>
</T>
</Header3>
</Header2>
</Header>
What I have so far:
from xml.etree import cElementTree as ET
file_path = '/path/to/my_xml.xml'
to_remove = []
root = None
for event, elem in ET.iterparse(file_path, events=("start", "end")):
if event == 'end':
if elem.tag == 'Id':
new_root = elem
#print([elem.tag for elem in new_root.iter()])
for elem2 in new_root.iter('Id'):
id = elem2.text
if id =='abcd':
print(id)
to_remove.append(new_root)
root = elem
for item in to_remove:
root.remove(item)
So the above code obviously doesn't work as the root is the whole xml file starting with Header and it can't find exactly the subelement that I am referring to remove, as its parent is Header3 not Header.
So the desired output would be:
<Header>
<Header2>
<Header3>
<T>
<Start>
<Id>qrlf</Id>
</Start>
</T>
</Header3>
</Header2>
</Header>
Going forward it is not a single value that I am to input to remove but thousands of values, so going to be a list, I just thought it is easier to represent the problem this way. Any help is appreciated.