2

I am trying to use elementTree's iterparse function to filter nodes based on the text and write them to a new file. I am using iterparse becuase the input file is large (100+ MB)

input.xml

<xmllist>
        <page id="1">
        <title>movie title 1</title>
        <text>this is a moviein theatres/text>
        </page>
        <page id="2">
        <title>movie title 2</title>
        <text>this is a horror film</text>
        </page>
        <page id="3">
        <title></title>
        <text>actor in film</text>
        </page>
        <page id="4">
        <title>some other topic</title>
        <text>nothing related</text>
        </page>
</xmllist>

Expected output (all pages where the text has "movie" or "film" in them)

<xmllist>
        <page id="1">
        <title>movie title 1</title>
        <text>this is a movie<n theatres/text>
        </page>
        <page id="2">
        <title>movie title 2</title>
        <text>this is a horror film</text>
        </page>
        <page id="3">
        <title></title>
        <text>actor in film</text>
        </page>
</xmllist>

Current code

import xml.etree.cElementTree as etree
from xml.etree.cElementTree import dump

output_file=open('/tmp/outfile.xml','w')

for event, elem in iter(etree.iterparse("/tmp/test.xml", events=('start','end'))):
    if event == "end" and elem.tag == "page": #need to add condition to search for strings
        output_file.write(elem)
        elem.clear()

How do I add the regular expression to filter based on page's text attribute?

Rajesh Chamarthi
  • 18,568
  • 4
  • 40
  • 67

1 Answers1

2

You're looking for a child, not an attribute, so it's simplest to analyze the title as it "passes by" in the iteration and remember the result until you get the end of the resulting page:

import re

good_page = False
for event, elem in iter(etree.iterparse("/tmp/test.xml", events=('start','end'))):
    if event == 'end':
        if elem.tag = 'title':
            good_page = re.search(r'film|movie', elem.text)
        elif elem.tag == 'page':
            if good_page:
                output_file.write(elem)
            good_page = False
            elem.clear()

The re.search will return None if not found, and the if treats that as false, so we're avoiding the writing of pages without a title as well as ones whose title's text does not match your desired RE.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395