How can I remove XML parts with iterparse with parents included using ElementTree in Python?

Question

I have multiple large files that I need to import and iterate through them - all of them are xmls and have the same tree structure. The structure is something like this with some extra text apart from the ID so under the Start there are more children element tags: What I would like to do, is to input a list of Ids which I know is wrong and remove that report from the whole XML file. One report is between two "T"s.

<Header>
        <Header2>
           <Header3>
           <T>
              <Start> 
                <Id>abcd</Id>
              </Start>
           </T>
           <T>
              <Start> 
                <Id>qrlf</Id>
              </Start>
           </T>
           </Header3>
        </Header2>
</Header>

What I have so far:

from xml.etree import cElementTree as ET

file_path = '/path/to/my_xml.xml'
to_remove = []
root = None
for event, elem in ET.iterparse(file_path, events=("start", "end")):
if event == 'end':
    if elem.tag == 'Id':
        new_root = elem
        #print([elem.tag for elem in new_root.iter()])
        for elem2 in new_root.iter('Id'):
             id = elem2.text
             if id =='abcd':
                print(id)
                to_remove.append(new_root)
root = elem
for item in to_remove:
    root.remove(item)

So the above code obviously doesn't work as the root is the whole xml file starting with Header and it can't find exactly the subelement that I am referring to remove, as its parent is Header3 not Header.

So the desired output would be:

<Header>
        <Header2>
           <Header3>
           <T>
              <Start> 
                <Id>qrlf</Id>
              </Start>
           </T>
           </Header3>
        </Header2>
</Header>

Going forward it is not a single value that I am to input to remove but thousands of values, so going to be a list, I just thought it is easier to represent the problem this way. Any help is appreciated.

Also I had a previous version of this, which imported the whole file and tried to find with the "findall" function the Id in place and remove it - and it became terribly slow therefore chosen to solve it this way (with iterparse), for small files that fits but for large ones it failed. — Anna Semjén, Aug 29 '19 at 12:48

score 1 · Answer 1 · answered Aug 29 '19 at 15:08

Since your XML stucture is simple it's probably easier to use Xpath (about 1/3rd the way down https://docs.python.org/3/library/xml.etree.elementtree.html). The following are the usage examples from that section of the documentation page:

import xml.etree.ElementTree as ET

root = ET.fromstring(countrydata)

# Top-level elements
root.findall(".")

# All 'neighbor' grand-children of 'country' children of the top-level
# elements
root.findall("./country/neighbor")

# Nodes with name='Singapore' that have a 'year' child
root.findall(".//year/..[@name='Singapore']")

# 'year' nodes that are children of nodes with name='Singapore'
root.findall(".//*[@name='Singapore']/year")

# All 'neighbor' nodes that are the second child of their parent
root.findall(".//neighbor[2]")

The XML stucture used for the examples can be found at the top of the doc page.

The second example shows an easy way to select the subelements you want to be removed ("T" in your case) but in your case the 2nd last case may be more useful. But see the [tag='text'] operation in the Xpath Syntax section that appears just below the examples.
Send the results of that operation to the remove operation (~3/4 down the page) followed by the XMLtree write operation (~4/5ths down the page) to get the cleaned up XML.

The above assumes you are passing a string, you have to use parse to input from a file, e.g :

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

** DISCLAIMER *** I'm doing similar work but I haven't actually tried doing this. So think of this as inspiration, not as a complete solution.

BTW, I'm using python 3.7.4. For those who don't alreaay know, you can use the version selector at the top left of the doc page to select the version you are using.

Thank you for this, my first code looked something like this - I should have attached it: `for item in rem_list:` `for elem in root.findall("./T/Start/[Id='{}']...".format(item)):` ` # new` `root.remove(elem)` `ET.tostring(root, encoding='utf8').decode('utf8')` This had quite a bad performance therefore I chose the streaming option - but will look into this - thanks — Anna Semjén, Aug 30 '19 at 07:30

Martin Honnen · Accepted Answer · 2019-08-29T18:10:04.583

I think you can use

ids_to_remove = ['abcd']

elements_to_remove = []

for event, element in ET.iterparse('file.xml'):
    if element.tag == 'T' and element.find('Start/Id').text in ids_to_remove:
        elements_to_remove.append(element)
    if element.tag == 'Header3':
        for el in elements_to_remove:
            element.remove(el)
            el.clear()
    if element.tag == 'Header':
        root = element

ET.dump(root)

I haven't tested how that works with huge files, obviously it collects all elements to be removed first and finally removes them, I am not sure there is a way in the ElementTree API to detach element in the if element.tag == 'T' and element.find('Start/Id').text in ids_to_remove: branch, perhaps the following frees the element earlier:

ids_to_remove = ['abcd', 'baz', 'bar']


for event, element in ET.iterparse('file.xml', events = ['start', 'end']):
    if event == 'end' and element.tag == 'T' and element.find('Start/Id').text in ids_to_remove:
        header3.remove(element)
        element.clear()
    if event == 'start' and element.tag == 'Header3':
        header3 = element;
    if element.tag == 'Header':
        root = element


ET.dump(root)

Thanks for this I think I understand now the streaming concept of this - I have a few namespaces and forgot to mention that the tag T is present in the main "T" block so had to make sure that I pick the main one and amend the script with a few additions - for some strange reason It looks like the first one runs a bit faster, but the difference is not very big - again thank you very much for this — Anna Semjén, Aug 30 '19 at 11:57
@AnnaSemjén, as for performance, using `lxml` instead with e.g. `from lxml import etree as ET` instead of the `cElementTree` in a simple test with some larger file showed a great reduction of processing time so perhaps that also speeds up things for you. — Martin Honnen, Aug 30 '19 at 12:51

How can I remove XML parts with iterparse with parents included using ElementTree in Python?

2 Answers2