Questions tagged [iterparse]

iterparse is used by XML parsers for tracking changes to the tree while it is being built

This tag is used in an XML parsing code. Usually iterparse builds a tree when parsing the XML. Also you can safely rearrange or remove parts of the tree while parsing.

See also:

83 questions
4
votes
1 answer

Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations

I need to put together a piece of code that parses a possibly large XML file into custom Python objects. The idea is roughly the following: from lxml import etree for e, tag in etree.iterparse(source, tag='Foo'): print tag.xpath('bar/baz')[42] #…
Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
3
votes
5 answers

Ignore encoding errors in Python (iterparse)?

I've been fighting with this for an hour now. I'm parsing an XML-string with iterparse. However, the data is not encoded properly, and I am not the provider of it, so I can't fix the encoding. Here's the error I get: lxml.etree.XMLSyntaxError: line…
Martti Laine
  • 12,655
  • 22
  • 68
  • 102
3
votes
2 answers

lxml iterparse fills memory despite on clear

I'm trying to parse xml. First iterparse works correctly, but second starts to fill memory. If remove the first iterparse, then nothing changes. Xml is valid. def clear_element(e): e.clear() while e.getprevious() is not None: del…
Shihal
  • 33
  • 3
3
votes
1 answer

xml.etree.ElementTree iterparse() still using lots of memory?

I've been experimenting with iterparse to reduce the memory footprint of my scripts that need to process large XML docs. Here's an example. I wrote this simple script to read a TMX file and split it into one or more output files not to exceed a…
dbl
  • 163
  • 1
  • 11
3
votes
1 answer

How long should ElementTree iterparse take?

In answering another question, someone showed me the following tutorial, in which the author claims to have used iterparse to parse a ~100 MB XML file in under 3…
russell
  • 350
  • 1
  • 13
3
votes
1 answer

Grabbing tag with lxml's iterparse</a></h3> <div class="excerpt">I'm running into a problem with using lxml's iterparse on my HTML. I'm trying to get the <title>'s text but this simple function doesn't work on complete web pages: def get_title(str): titleIter = etree.iterparse(StringIO(str), tag="title") …</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/dom" class="post-tag grid--cell" title="show questions tagged 'dom'" rel="tag">dom</a> <a href="../../questions/tagged/web-scraping" class="post-tag grid--cell" title="show questions tagged 'web-scraping'" rel="tag">web-scraping</a> <a href="../../questions/tagged/lxml" class="post-tag grid--cell" title="show questions tagged 'lxml'" rel="tag">lxml</a> <a href="../../questions/tagged/iterparse" class="post-tag grid--cell" title="show questions tagged 'iterparse'" rel="tag">iterparse</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Apr 24 '12 at 01:16">asked Apr 24 '12 at 01:16</time> <a href="../../users/378622/ben-g" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/378622.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Ben G" /> </a> <div class="s-user-card--info"> <a href="../../users/378622/ben-g" class="s-user-card--link">Ben G</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">26,091</li> <li class="s-award-bling s-award-bling__gold" title="34 gold badges">34</li> <li class="s-award-bling s-award-bling__silver" title="103 silver badges">103</li> <li class="s-award-bling s-award-bling__bronze" title="170 bronze badges">170</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-10103281"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>3</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status "> <strong>0</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/10103281/validate-in-python-with-lxml-s-iterparse-against-multiple-dtds-conditionally" class="question-hyperlink">validate in python with lxml's iterparse against multiple DTDs conditionally</a></h3> <div class="excerpt">I am parsing and validating fairly large XMLs (>100MB) against multiple DTDs, conditionally, based on the docinfo: parser = etree.XMLParser(recover=True) xmlfile = etree.parse(file,parser) if "aaa.dtd" in xmlfile.docinfo.doctype.lower(): …</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/lxml" class="post-tag grid--cell" title="show questions tagged 'lxml'" rel="tag">lxml</a> <a href="../../questions/tagged/dtd" class="post-tag grid--cell" title="show questions tagged 'dtd'" rel="tag">dtd</a> <a href="../../questions/tagged/iterparse" class="post-tag grid--cell" title="show questions tagged 'iterparse'" rel="tag">iterparse</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Apr 11 '12 at 09:29">asked Apr 11 '12 at 09:29</time> <a href="../../users/807328/panos" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/807328.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Panos" /> </a> <div class="s-user-card--info"> <a href="../../users/807328/panos" class="s-user-card--link">Panos</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">85</li> <li class="s-award-bling s-award-bling__silver" title="1 silver badges">1</li> <li class="s-award-bling s-award-bling__bronze" title="4 bronze badges">4</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-7182234"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>2</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>1</strong> answer </div> </div> </div> <div class="summary"> <h3><a href="../../questions/7182234/parsing-large-xml-file-with-python-lxml-and-iterparse" class="question-hyperlink">Parsing Large XML file with Python lxml and Iterparse</a></h3> <div class="excerpt">I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items. My file is of the format: <item> <title>Item 1 Description 1

Dave Johnshon
  • 475
  • 1
  • 7
  • 6
2
votes
0 answers

Possible Memory Leak in Parsing a XML File?

I have a long running script, which parses a large XML file(~9GB) and inserts data into a database in chunks. This is what that looks like, import lxml.etree as ET import gc def __get_elements1(self): context = ET.iterparse(tmp_folder +…
2
votes
2 answers

Parsing of a huge xml file with `pythons etree.iterparse()` does not work right. Is there a logic error in the code?

I want to parse a huge file xml-file. The records in this huge file do look for example like this. And in general the file looks like this record_1 ... …
Aufwind
  • 25,310
  • 38
  • 109
  • 154
2
votes
2 answers

Parsing incrementally a large wikipedia dump XML file using python

The goal is to read all … stuff from a Wikipedia DUMP (70Gb file). This is not possible to load in memory, therefore I tried to parse the file incrementally and get some values from it. However the script I just wrote does not print anything and…
Captain Nemo
  • 345
  • 2
  • 14
2
votes
1 answer

element attributes missing when parsing XML with iterparse/lxml/python 2

Here's my use case: I have a potentially large XML file, and I want to output the frequency of all the unique structural variations of a given element type. Element attributes should be included as part of the uniqueness test. The output should sort…
George
  • 579
  • 1
  • 5
  • 12
2
votes
1 answer

iterparse large XML using python

This has been driving me nuts all day and i would appreciate a bit of help with parsing a large XML file ... files contains over 900,000 lines and is downloaded in gzip format, i did have something working using an extract of the data for testing…
Sandman112
  • 65
  • 1
  • 4
2
votes
2 answers

lxml.etree iterparse() and parsing element completely

I have an XML file with nodes that looks like this: 41.3681107 3.9598 I am using lxml.etree.iterparse() to iteratively parse…
Andreas
  • 83
  • 2
  • 8
2
votes
1 answer

python element tree iterparse filter nodes and children

I am trying to use elementTree's iterparse function to filter nodes based on the text and write them to a new file. I am using iterparse becuase the input file is large (100+ MB) input.xml movie title…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/python" class="post-tag grid--cell" title="show questions tagged 'python'" rel="tag">python</a> <a href="../../questions/tagged/iterparse" class="post-tag grid--cell" title="show questions tagged 'iterparse'" rel="tag">iterparse</a> <a href="../../questions/tagged/celementtree" class="post-tag grid--cell" title="show questions tagged 'celementtree'" rel="tag">celementtree</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Jan 31 '15 at 15:02">asked Jan 31 '15 at 15:02</time> <a href="../../users/237939/rajesh-chamarthi" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/237939.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Rajesh Chamarthi" /> </a> <div class="s-user-card--info"> <a href="../../users/237939/rajesh-chamarthi" class="s-user-card--link">Rajesh Chamarthi</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">18,568</li> <li class="s-award-bling s-award-bling__gold" title="4 gold badges">4</li> <li class="s-award-bling s-award-bling__silver" title="40 silver badges">40</li> <li class="s-award-bling s-award-bling__bronze" title="67 bronze badges">67</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="s-pagination pager fr"> <a class="s-pagination--item" href="../../questions/tagged/iterparse_page=1" rel="prev" title="Go to page 1">Prev </a> <a class="s-pagination--item" href="../../questions/tagged/iterparse_page=1" rel="" title="Go to page 1">1</a> <div class="s-pagination--item is-selected">2</div> <a class="s-pagination--item" href="../../questions/tagged/iterparse_page=3" rel="" title="Go to page 3">3</a> <a class="s-pagination--item" href="../../questions/tagged/iterparse_page=4" rel="" title="Go to page 4">4</a> <a class="s-pagination--item" href="../../questions/tagged/iterparse_page=5" rel="" title="Go to page 5">5</a> <a class="s-pagination--item" href="../../questions/tagged/iterparse_page=6" rel="" title="Go to page 6">6</a> <a class="s-pagination--item" href="../../questions/tagged/iterparse_page=3" rel="next" title="Go to page 3"> Next</a> </div> </div> </div> </div> </div> <script src="../../static/js/stack-icons.js"></script> <script src="../../static/js/fromnow.js"></script> </body> </html>