I am trying to use elementTree's iterparse function to filter nodes based on the text and write them to a new file. I am using iterparse becuase the input file is large (100+ MB)
input.xml
<xmllist>
<page id="1">
<title>movie title 1</title>
<text>this is a moviein theatres/text>
</page>
<page id="2">
<title>movie title 2</title>
<text>this is a horror film</text>
</page>
<page id="3">
<title></title>
<text>actor in film</text>
</page>
<page id="4">
<title>some other topic</title>
<text>nothing related</text>
</page>
</xmllist>
Expected output (all pages where the text has "movie" or "film" in them)
<xmllist>
<page id="1">
<title>movie title 1</title>
<text>this is a movie<n theatres/text>
</page>
<page id="2">
<title>movie title 2</title>
<text>this is a horror film</text>
</page>
<page id="3">
<title></title>
<text>actor in film</text>
</page>
</xmllist>
Current code
import xml.etree.cElementTree as etree
from xml.etree.cElementTree import dump
output_file=open('/tmp/outfile.xml','w')
for event, elem in iter(etree.iterparse("/tmp/test.xml", events=('start','end'))):
if event == "end" and elem.tag == "page": #need to add condition to search for strings
output_file.write(elem)
elem.clear()
How do I add the regular expression to filter based on page's text attribute?