1

I have parsed an XML file with xmltodict, and I have the discovered the path to the <coordinates> tag from which I wish to extract lat & long values to add to a dataframe. Here is a small sample:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
   <Document>
      <Folder>
         <name>One Line Diagram</name>
         <open>0</open>
         <Folder>
            <name>SectionOne</name>
            <open>0</open>
            <Folder>
               <name>Node</name>
               <open>0</open>
               <Placemark>
                  <name>5680420</name>
                  <styleUrl>#Style_0</styleUrl>
                  <description />
                  <MultiGeometry type="MultiGeometry" Type="MultiGeometry">
                     <Polygon>
                        <outerBoundaryIs>
                           <LinearRing>
                              <coordinates>-83.6514766,67.0234192 -83.6515403,67.0233918 -83.6515309,67.0233134 -83.6514609,67.0232885 -83.5778406,67.0246267 -83.5777768,67.0246541 -83.5777861,67.0247325 -83.5778560,67.0247574 -83.6514766,67.0234192</coordinates>
                           </LinearRing>
                        </outerBoundaryIs>
                     </Polygon>
                  </MultiGeometry>
               </Placemark>
               <Placemark>
                  <name>25934531</name>
                  <styleUrl>#Style_0</styleUrl>
                  ML60
                  <description />
                  <MultiGeometry type="MultiGeometry" Type="MultiGeometry">
                     <Polygon>
                        <outerBoundaryIs>
                           <LinearRing>
                              <coordinates>-83.6512679,67.0216805 -83.6513317,67.0216531 -83.6513222,67.0215747 -83.6512522,67.0215498 -83.5967049,67.0225434 -83.5966412,67.0225708 -83.5966505,67.0226492 -83.5967204,67.0226741 -83.6512679,67.0216805</coordinates>
                           </LinearRing>
                        </outerBoundaryIs>
                     </Polygon>
                  </MultiGeometry>
               </Placemark>
            </Folder>
         </Folder>
      </Folder>
   </Document>
</kml>

And the path is below.

> doc['kml']['Document']['Folder']['Folder']['Folder'][0]['Placemark'][0]['MultiGeometry']['Polygon']['outerBoundaryIs']['LinearRing']['coordinates']

This is an extremely long xml document with 4 Folder tags, but I only need the values from the first ['Folder'][0]. What I have no clue how to do is iterate through all the ['Placemark'][n] until all the coordinates are extracted.

I have tried several things, the last is below, which is an attempt to start working my way down to the correct tag. But to no avail.

root_elements = doc['Document'] if type(doc['Document']) == OrderedDict else [doc['Document']]
for element in root_elements:
    print(element['Placemark'])

Traceback:

Traceback (most recent call last)
<ipython-input-69-db580dc8b6e2> in <module>()
----> 1 root_elements = doc['Document'] if type(doc['Document']) == OrderedDict else [doc['Document']]
      2 for element in root_elements:
      3     print(element['Placemark'])

KeyError: 'Document'

Any help is appreciated.

  • The error is telling you that there is no key 'Document' in `doc`. Doesn't the path you posted start with `doc['kml']['Document']` (not `doc['Document']`)? – Galen Dec 22 '17 at 03:04
  • Now that you say that, why Yes It Is. I feel stupid. Thanks. –  Dec 22 '17 at 14:06

1 Answers1

0

Your xml is missing closing tags for 2 folders (4th to last & 3rd to last lines below. Just copy & paste them into your XML file).

Indented XML using this tool https://www.freeformatter.com/xml-formatter.html#ad-output

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
   <Document>
      <Folder>
         <name>One Line Diagram</name>
         <open>0</open>
         <Folder>
            <name>SectionOne</name>
            <open>0</open>
            <Folder>
               <name>Node</name>
               <open>0</open>
               <Placemark>
                  <name>5680420</name>
                  <styleUrl>#Style_0</styleUrl>
                  <description />
                  <MultiGeometry type="MultiGeometry" Type="MultiGeometry">
                     <Polygon>
                        <outerBoundaryIs>
                           <LinearRing>
                              <coordinates>-83.6514766,67.0234192 -83.6515403,67.0233918 -83.6515309,67.0233134 -83.6514609,67.0232885 -83.5778406,67.0246267 -83.5777768,67.0246541 -83.5777861,67.0247325 -83.5778560,67.0247574 -83.6514766,67.0234192</coordinates>
                           </LinearRing>
                        </outerBoundaryIs>
                     </Polygon>
                  </MultiGeometry>
               </Placemark>
               <Placemark>
                  <name>25934531</name>
                  <styleUrl>#Style_0</styleUrl>
                  ML60
                  <description />
                  <MultiGeometry type="MultiGeometry" Type="MultiGeometry">
                     <Polygon>
                        <outerBoundaryIs>
                           <LinearRing>
                              <coordinates>-83.6512679,67.0216805 -83.6513317,67.0216531 -83.6513222,67.0215747 -83.6512522,67.0215498 -83.5967049,67.0225434 -83.5966412,67.0225708 -83.5966505,67.0226492 -83.5967204,67.0226741 -83.6512679,67.0216805</coordinates>
                           </LinearRing>
                        </outerBoundaryIs>
                     </Polygon>
                  </MultiGeometry>
               </Placemark>
            </Folder>
         </Folder>
      </Folder>
   </Document>
</kml>

Using xmltodict to extract coordinates from coordinates.xml file containing your XML (with 2 missing folder closing tags included)

import xmltodict

with open('coordinates.xml') as coords:
    doc = xmltodict.parse(coords.read())

coordinates = []

#Loop and get each placemark tag in document
for placemark in doc['kml']['Document']['Folder']['Folder']['Folder']['Placemark']:
    #Get coordinates string from current placemark
    coordinateString=placemark['MultiGeometry']['Polygon']['outerBoundaryIs']['LinearRing']['coordinates']

    #split coordinates string into lists of coordinates. Split co-ord pairs by space (" "). Split x & y of each co-ord by comma (",")
    coordinateList=[x.split(",") for x in coordinateString.split(" ")]
    coordinates.append(coordinateList)

print(coordinates)

Output of printing "coordinates" list

[[[u'-83.6514766', u'67.0234192'], [u'-83.6515403', u'67.0233918'], [u'-83.6515309', u'67.0233134'], [u'-83.6514609', u'67.0232885'], [u'-83.5778406', u'67.0246267'], [u'-83.5777768', u'67.0246541'], [u'-83.5777861', u'67.0247325'], [u'-83.5778560', u'67.0247574'], [u'-83.6514766', u'67.0234192']], [[u'-83.6512679', u'67.0216805'], [u'-83.6513317', u'67.0216531'], [u'-83.6513222', u'67.0215747'], [u'-83.6512522', u'67.0215498'], [u'-83.5967049', u'67.0225434'], [u'-83.5966412', u'67.0225708'], [u'-83.5966505', u'67.0226492'], [u'-83.5967204', u'67.0226741'], [u'-83.6512679', u'67.0216805']]]

coordinates[0] gives list of coordinates from 1st placemark tag

[[u'-83.6514766', u'67.0234192'], [u'-83.6515403', u'67.0233918'], [u'-83.6515309', u'67.0233134'], [u'-83.6514609', u'67.0232885'], [u'-83.5778406', u'67.0246267'], [u'-83.5777768', u'67.0246541'], [u'-83.5777861', u'67.0247325'], [u'-83.5778560', u'67.0247574'], [u'-83.6514766', u'67.0234192']], [[u'-83.6512679', u'67.0216805']

coordinates[0][0] gives first coordinate pair from 1st placemark tag

[u'-83.6514766', u'67.0234192']

coordinates[0][0] gives x value of first coordinate pair from 1st placemark tag

-83.6514766
Peter Out
  • 148
  • 1
  • 9
  • Sorry about the formatting. What you see is a very small portion of the entire code. I just put in some so the board could get a sense of the data. –  Dec 22 '17 at 13:50
  • Peter Out, what you did is like what I was doing, but I need to extract tuples from multiple `['Placemark']` tags when the number can change from file to file. This morning I thought maybe the simple solution is `placemark = doc['kml']['Document']['Folder']['Folder']['Folder']['Placemark']` then do `len(placemark)` I'll try this. –  Dec 22 '17 at 14:01
  • I will add I like how you brought the path to where I need into a variable. Like lots of people on the board, I'm teaching myself Python and sometimes the obvious things slip past. –  Dec 22 '17 at 14:05
  • Sure thing, i'm by no means an expert at Python myself. – Peter Out Dec 23 '17 at 01:03