0

I have an OSM file that captures a small neighborhood. http://pastebin.com/xeWJsPeY

I have Python code that does a lot of extra parsing, but an example of the main problem can be seen here:

import xml.etree.cElementTree as CET
osmfile = open('osm_example.osm','r')
for event, elem in CET.iterparse(osmfile,events = ('start',)):
    if elem.tag == 'way':
        if elem.get('id') == "21850789":
            for child in elem:
                print CET.tostring(child,encoding='utf-8')
    elem.clear()

Here, and elsewhere, I noticed that the tags for a specific entry are missing (where tag is an element that looks like <tag k="highway" v="residential" />. All of the <nd .../> elements were read correctly, as far as I can see.

One other thing I noticed when processing the files is that when I use tostring() on an element with a 'way' tag, if there are errors with the <tag .../> elements being read, it didn't append a newline to the end of it. e.g., when running

for event, elem in CET.iterparse(osmfile,events = ('start',)):
    if elem.tag == 'way':
        print CET.tostring(elem,encoding='utf-8')
    elem.clear()

The output for an entry with missing <tag .../> elements is

<nd ref="235476200" />
  <nd ref="1865868598" /></way><way changeset="12727901" id="21853023" timestamp="2012-08-14T15:23:13Z" uid="451048" user="bbmiller" version="8" visible="true">
  <nd ref="1865868557" />

versus one that is formed just fine,

 <tag k="tiger:zip_left" v="60061" />
  <tag k="tiger:zip_right" v="60061" />
 </way>
 <way changeset="15851022" id="21874389" timestamp="2013-04-24T16:33:28Z" uid="451693" user="bot-mode" version="3" visible="true">
  <nd ref="235666887" />
  <nd ref="235666891" />

What is the issue that is going on here?

Max Candocia
  • 4,294
  • 35
  • 58

1 Answers1

1

You seem to be searching for child elements in response to the start event. But the child elements have not necessarily been read yet.

Consider this fragment:

<a>foo<b/></a>

The start event occurs after the parser has read <a>, but before it reads foo and, more to the point, before it reads <b/>. As the documentation says:

Note iterparse() only guarantees that it has seen the “>” character of a starting tag when it emits a “start” event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present.

If you need a fully populated element, look for “end” events instead.

So, you might get the behavior you want with this code:

for event, elem in CET.iterparse(osmfile,events = ('end',)):
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • I used 'end' and changed the way I cleared elements. I avoided 'end' before because I wanted to clear the children last, but I was able to make a workaround for that. Thanks. – Max Candocia Apr 20 '15 at 16:38