-1

I have a large ..xml file with many items closed by a tag, like MATCH. Also, there are other tags too at same level as MATCH, like STATISTIC, so the xml file would be like:

<?xml version="1.0"?>
<MAIN>
<SOME_TAG info="some_tag" >
<TAG value="0" />
</SOME_TAG>
<MATCH match="match_1" >
<ITEM item="1" />
</MATCH>
<MATCH match="match_2" >
<ITEM item="2" />
</MATCH>
<MATCH match="match_3" >
<ITEM item="3" />
</MATCH>
<STATISTIC stat="stat_1" >
<VALUE value="1" />
</STATISTIC>
<STATISTIC stat="stat_2" >
<VALUE value="2" />
</STATISTIC>
<STATISTIC stat="stat_3" >
<VALUE value="3" />
</STATISTIC>
<ANOTHER_TAG info="another_tag" >
<TAG value="0" />
</ANOTHER_TAG>
</MAIN>

I want to read all contents closed in tags MATCH and STATISTIC. My code is something like

import lxml.etree as etree

def read_xml(xml_file):
    context = etree.iterparse(xml_file, events=["end"], recover=True)
    my_intests = []
    for event, element in context:
        if element.tag == "MATCH":
            for match_elem in element.findall("ITEM"):
                my_intests.append(match_elem.get("item"))
            element.clear()
            for ancestor in element.xpath("ancestor-or-self::*"):
                while ancestor.getprevious() is not None:
                    del ancestor.getparent()[0]
        elif element.tag == "STATISTIC":
            for stat_elem in element.findall("VALUE"):
                my_intests.append(stat_elem.get("value"))
            element.clear()
            for ancestor in element.xpath("ancestor-or-self::*"):
                while ancestor.getprevious() is not None:
                    del ancestor.getparent()[0]
        elif element.tag == "ANOTHER_TAG":
            element.clear()
            break
    print(my_intests) # print: ['1', '2', '3', '1', '2', '3']

After finishing reading STATISTIC, all element will not be cleared, thus will be very slow if there are many tags after tag ANOTHER_TAG.

One possible solution is firstly call context = etree.iterparse(xml_file, events=["end"], recover=True, tag="MATCH") for reading all MATCH tags, then call context = etree.iterparse(xml_file, events=["end"], recover=True, tag="STATISTIC") for all STATISTIC tags, but have to go through the file twice, or more if I want to read other tags.

I can also use if element.tag=="ANOTHER_TAG" and break to stop reading, as I have done here. But I may be intested ANOTHER_TAG too later, so would be not optimal to find out what is the tag after all tags I'm intested and then break. The xml file tags may be changed too in future.

So I think a better solution can have a if all(tag in {"MATCH", "STATISTIC"}) have been processed check, instead of if element.tag=="ANOTHER_TAG" here, then break to finish the reading. Any way to do this?

Elkan
  • 546
  • 8
  • 23

1 Answers1

0

I believe you are over-complicating this a bit. Try it this way:

vals = context.xpath('//STATISTIC/VALUE/@value')
items = context.xpath('//MATCH/ITEM/@item')
my_intests=vals+items
print(my_intests)

Output:

['1', '2', '3', '1', '2', '3']
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45