How to apply xmlTree iterparse to nested XML set

Question

I am trying to replicate the example from this tutorial, but using iterparse with elem.clear().

XML example:

<?xml version="1.0" encoding="UTF-8"?>
<scenario>
    <world>
        <region name="USA">
            <AgSupplySector name="Corn" nocreate="1">
                <AgSupplySubsector name="Corn_NelsonR" nocreate="1">
                    <AgProductionTechnology name="Corn_NelsonR" nocreate="1">
                        <period year="1975">
                            <Non-CO2 name="SO2_1_AWB">
                                <input-emissions>3.98749e-05</input-emissions>
                                <output-driver/>
                                <gdp-control name="GDP_control">
                                    <max-reduction>60</max-reduction>
                                    <steepness>3.5</steepness>
                                </gdp-control>
                            </Non-CO2>
                            <Non-CO2 name="NOx_AWB">
                                <input-emissions>0.000285263</input-emissions>
                                <output-driver/>
                                <gdp-control name="GDP_control">
                                    <max-reduction>60</max-reduction>
                                    <steepness>3.5</steepness>
                                </gdp-control>
                            </Non-CO2>
                        </period>
                    </AgProductionTechnology>
                </AgSupplySubsector>
            </AgSupplySector>
        </region>
    </world>
</scenario>

The output is expected like this: I am trying to parse it using the following code:

import os
import xml.etree.cElementTree as etree
import codecs
import csv

PATH = 'D:\Book1'
FILENAME_BIO = 'Test.csv'
FILENAME_XML = 'all_aglu_emissions.xml'
ENCODING = "utf-8"


pathBIO = os.path.join(PATH, FILENAME_BIO)
pathXML = os.path.join(PATH, FILENAME_XML)

with codecs.open(pathBIO, "w", ENCODING) as bioFH:
    bioWriter = csv.writer(bioFH, quoting=csv.QUOTE_MINIMAL)
    bioWriter.writerow(['Year','Gas', 'Value','Technology','Crop','Country'])

    for event, elem in etree.iterparse(pathXML, events=('start','end')):
        if event == 'start' and elem.tag == 'region':
            str1 = elem.attrib['name']
        elif event == 'start' and elem.tag == 'AgSupplySector':
            str2 = elem.attrib['name']
        elif event == 'start' and elem.tag == 'AgProductionTechnology':
            str3 = elem.attrib['name']
        elif event == 'start' and elem.tag == 'period':
            str4 = elem.attrib['year']
        elif event == 'start' and elem.tag == 'Non-CO2':
            str5 = elem.attrib['name']
        elif event == 'end' and elem.tag == 'input-emissions':
            for em in elem.iter('input-emissions'):
                str6 = em.text
                bioWriter.writerow([str4, str5, str6, str3, str2, str1])
            
            elem.clear()

My issue(s) here is that I got more extra lines with empty fields for str6. Probably, I have nesting problem here. Please help. Error example (0 fields appear):

Expected output is added. By replicate I mean replicate the output, but with iterparse. — Bex, Jun 07 '21 at 16:59

score 2 · Accepted Answer · answered Jun 07 '21 at 17:51

2

The for em in elem.iter('input-emissions') loop is useless, drop it.

import os
import xml.etree.ElementTree as etree
import csv

PATH = '.'
FILENAME_BIO = 'Test.csv'
FILENAME_XML = 'all_aglu_emissions.xml'


pathBIO = os.path.join(PATH, FILENAME_BIO)
pathXML = os.path.join(PATH, FILENAME_XML)

with open(pathBIO, 'w', encoding='utf8', newline='') as bioFH:
    bioWriter = csv.writer(bioFH, quoting=csv.QUOTE_MINIMAL)
    bioWriter.writerow('Year Gas Value Technology Crop Country'.split())

    for event, elem in etree.iterparse(pathXML, events=('start',)):
        if elem.tag == 'region':
            str1 = elem.attrib['name']
        elif elem.tag == 'AgSupplySector':
            str2 = elem.attrib['name']
        elif elem.tag == 'AgProductionTechnology':
            str3 = elem.attrib['name']
        elif elem.tag == 'period':
            str4 = elem.attrib['year']
        elif elem.tag == 'Non-CO2':
            str5 = elem.attrib['name']
        elif elem.tag == 'input-emissions':
            str6 = elem.text
            bioWriter.writerow([str4, str5, str6, str3, str2, str1])
        elem.clear()

There are some other subtle changes I made to the code, since I assume you're using Python 3 for this. They include using xml.etree.ElementTree instead of the obsolete xml.etree.cElementTree, skipping the codecs module (Python 3 can do that natively) and passing the newline='' parameter to the open() call, so the csv module can handle newlines correctly by itself.

Since listening to the start event is enough for the desired effect, I've dropped handling the end event entirely.

The result is

Year,Gas,Value,Technology,Crop,Country
1975,SO2_1_AWB,3.98749e-05,Corn_NelsonR,Corn,USA
1975,NOx_AWB,0.000285263,Corn_NelsonR,Corn,USA

answered Jun 07 '21 at 17:51

Tomalak

332,285
67
532
628

Thank you for your good comments and improvements! When I use [full xml set](https://github.com/kanishkan91/ConvertXMLtoDataframepy/blob/master/all_aglu_emissions.xml), I got several thousands with None like ths: Year Gas Value Technology Crop Country 2005 NOx_AGR Corn_ArkWhtRedR Corn USA But it should be : Year Gas Value Technology Crop Country 2005 NOx_AGR 0.0033201014 Corn_ArkWhtRedR Corn USA What could be wrong? – Bex Jun 07 '21 at 23:40
@Bex No idea...? Surely you've been working on finding out in the meantime, what have you discovered? – Tomalak Jun 08 '21 at 07:11
I tried to apply the same logic to another data XML set and got less lines than expected (by 90 lines). Again with a different set, when I changed events from ('start',) to ('end',) it worked. Here, I tried to use end events only, but getting NameError message: Name str# is not defined. – Bex Jun 09 '21 at 11:00
@Bex Of course such a solution assumes that your input XML has a completely regular structure. If you have any structural variations, i.e. occasionally missing elements, occasionally missing attributes, occasionally different element order, then you would have to rewrite the solution to take those variations into account. The above works for your sample XML in the question. If your real XML looks differently, your question should have shown some of that. – Tomalak Jun 09 '21 at 15:34
@Bex what if the elem.tag 'input-emissions' is occassionally missing? I'm trying to apply your wonderful method on a similar problem of mine, but instead of having the csv populated with blank/None when the element is missing, it takes the last available value where the element existed. I would appreciate any idea/help, or if you think it's more appropriate, I can post a question. – choabf Jul 17 '23 at 13:10
1

@choabf In that case you explicitly need to set your strings (the ones I have as `str1` etc) to empty after the `writerow()` call. This way they won't retain their previous values and cannot pollute the next row in the output CSV. – Tomalak Jul 18 '23 at 16:42
@Tomalak thank you so much! It worked : ) (And sorry for the delay in checking this out, got carried away with other items). Wish you a great rest of the summer (?) ahead! – choabf Aug 09 '23 at 12:25

How to apply xmlTree iterparse to nested XML set

1 Answers1