I have a large ..xml file with many items closed by a tag, like MATCH
. Also, there are other tags too at same level as MATCH
, like STATISTIC
, so the xml file would be like:
<?xml version="1.0"?>
<MAIN>
<SOME_TAG info="some_tag" >
<TAG value="0" />
</SOME_TAG>
<MATCH match="match_1" >
<ITEM item="1" />
</MATCH>
<MATCH match="match_2" >
<ITEM item="2" />
</MATCH>
<MATCH match="match_3" >
<ITEM item="3" />
</MATCH>
<STATISTIC stat="stat_1" >
<VALUE value="1" />
</STATISTIC>
<STATISTIC stat="stat_2" >
<VALUE value="2" />
</STATISTIC>
<STATISTIC stat="stat_3" >
<VALUE value="3" />
</STATISTIC>
<ANOTHER_TAG info="another_tag" >
<TAG value="0" />
</ANOTHER_TAG>
</MAIN>
I want to read all contents closed in tags MATCH
and STATISTIC
. My code is something like
import lxml.etree as etree
def read_xml(xml_file):
context = etree.iterparse(xml_file, events=["end"], recover=True)
my_intests = []
for event, element in context:
if element.tag == "MATCH":
for match_elem in element.findall("ITEM"):
my_intests.append(match_elem.get("item"))
element.clear()
for ancestor in element.xpath("ancestor-or-self::*"):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
elif element.tag == "STATISTIC":
for stat_elem in element.findall("VALUE"):
my_intests.append(stat_elem.get("value"))
element.clear()
for ancestor in element.xpath("ancestor-or-self::*"):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
elif element.tag == "ANOTHER_TAG":
element.clear()
break
print(my_intests) # print: ['1', '2', '3', '1', '2', '3']
After finishing reading STATISTIC
, all element
will not be clear
ed, thus will be very slow if there are many tags after tag ANOTHER_TAG
.
One possible solution is firstly call context = etree.iterparse(xml_file, events=["end"], recover=True, tag="MATCH")
for reading all MATCH
tags, then call context = etree.iterparse(xml_file, events=["end"], recover=True, tag="STATISTIC")
for all STATISTIC
tags, but have to go through the file twice, or more if I want to read other tags.
I can also use if element.tag=="ANOTHER_TAG"
and break
to stop reading, as I have done here. But I may be intested ANOTHER_TAG
too later, so would be not optimal to find out what is the tag after all tags I'm intested and then break
. The xml file tags may be changed too in future.
So I think a better solution can have a if all(tag in {"MATCH", "STATISTIC"}) have been processed
check, instead of if element.tag=="ANOTHER_TAG"
here, then break
to finish the reading. Any way to do this?