Parsing nested children nodes using pandas.read_xml

Question

I would like to import an xml with nested structure into a pandas dataframe. I include a sample xml

<?xml version='1.0' encoding='utf-8'?>
<AuditFile xmlns="urn:OECD:StandardAuditFile-Taxation/2.00">
    <MasterFiles>
        <Customers>
            <Customer>
                <RegistrationNumber>FR16270524</RegistrationNumber>
                <Name>Test guy S.A</Name>
                <Address>
                    <StreetName>1, av des Champs Elysées</StreetName>
                    <City>France</City>
                    <PostalCode>75000</PostalCode>
                    <Country>FR</Country>
                </Address>
                <Contact>
                    <ContactPerson>
                        <FirstName>Boom</FirstName>
                        <LastName>Baker</LastName>
                    </ContactPerson>
                    <Telephone>+331523526</Telephone>
                    <Email>boom.baker@sample.com</Email>
                </Contact>
                <TaxRegistration>
                    <TaxRegistrationNumber>FR16270524</TaxRegistrationNumber>
                    <TaxNumber>FR16270524</TaxNumber>
                </TaxRegistration>
                <CustomerID>2800002252</CustomerID>
                <AccountID>400100</AccountID>
                <OpeningDebitBalance>32.76</OpeningDebitBalance>
                <ClosingDebitBalance>0.0</ClosingDebitBalance>
            </Customer>
        </Customers>        
    </MasterFiles>      
</AuditFile>

By 'flatten', I mean that I would like that every end node corresponds to a column in the dataframe. The first column names will then be RegistrationNumber, Name, StreetName, etc...

Pandas' documentation of method read_xml () mentions, in the documentation of parameter elems_only, that, "by default, all child elements and non-empty text nodes are returned."

This is not the case if the structure contains nested children. Unlike Excel, only the first level of nodes is imported, not the nested ones.

I read a similar question on Javascript, where it was necessary to "flatten" the xml before import in a dataframe. I also looked at previous question Flatten XML data as a pandas dataframe, but the solution provided using XSLT is heavy and beyond my skills.

2 questions:

Is there a pandas' functionality that addresses those nested xmls that I missed (like Excel does)?
if this is not the case, is there an easier way to flatten or should I define the flat format manually (e.g by parsing into a dict and redefining the key as suggested in the stackoverflow article mentioned above?

Please edit your question and add a (well formed) sample xml (not an image) as well as what your expected output from that sample xml is supposed to look like.. — Jack Fleeting, Aug 06 '23 at 22:49

Hermann12 · Accepted Answer · 2023-08-07T10:09:16.143

1

You can use * as wildcard in xpath and // to dive deeper:

import pandas as pd

xml="""your xml here ..."""

df = pd.read_xml(xml, xpath= './/*')
print(df)

edited Aug 07 '23 at 10:09

answered Aug 07 '23 at 08:54

Hermann12

1,709
2
5
14

So, simple I feel stupid... I read the doc half a dozen of times, and I never notices this little wildcard (the doc could be more talkative about such great option). – JCF Aug 07 '23 at 18:44

Parsing nested children nodes using pandas.read_xml

1 Answers1