Python + XML documents

Question

I'm a bit new to XML and python. Below is a cut down version of a large XML file I'm trying to bring into python to eventually write into SQL Server db.

<?xml version="1.0" encoding="utf-8"?>
<MyOrgRefData:OrgRefData xmlns:MyOrgRefData="http://refdata.org/org/v2-0-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.org/org/v2-0-0/MyOrgRefData.xsd">
  <Manifest>
    <Version value="2-0-0" />
    <PublicationType value="Full" />
    <PublicationSource value="TEST123" />
    <PublicationDate value="2022-05-23" />
    <PublicationSeqNum value="1659" />
    <FileCreationDateTime value="2022-05-23T22:14:47" />
    <RecordCount value="287654" />
    <ContentDescription value="FullFile_20220523" />
    <PrimaryRoleScope>
      <PrimaryRole id="123" displayName="Free beer for me" />
      <PrimaryRole id="456" displayName="Free air for you" />
    </PrimaryRoleScope>
  </Manifest>
  <CodeSystems>
    <CodeSystem name="OrganisationRecordClass" oid="1.2.3.4.5">
      <concept id="RC2" code="2" displayName="World1" />
      <concept id="RC1" code="1" displayName="World2" />
    </CodeSystem>
    <CodeSystem name="OrganisationRole" oid="5.4.7.8">
      <concept id="B1ng0" code="179" displayName="BoomBastic" />
      <concept id="R2D2a" code="180" displayName="Fantastic" />
    </CodeSystem>
  </CodeSystems>
</MyOrgRefData:OrgRefData>

I've tried with lxml, pandas.read_xml, xml.etree and I'm not able to understand how to get what I want.

Ideally I'd like to pull in Manifest into a dataframe ready to to send to SQL (pd.to_sql()). I would do the same with CodeSystems as well, but separately. (there are other sections but I cut them off to shorten)

For example, using pandas to read in, I can only get a column with the values in. But I would like to either have the tag (Version, PublicationType, PublicationSource etc) in a column by the side of the value, or have them as the column headers and the values pivoted across the row instead.

dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
    dataFolder+'Data_Full_20220523.xml', 
    xpath='//Manifest/*', 
    attrs_only=True ,
    )
df_bulk.head()

This is the output I get:

inx	value
0	2-0-0
1	Full
2	TEST123
3	2022-05-23
4	1659
5	2022-05-23T22:14:47
6	287654
7	FullFile_20220523

Ideally I would like:

inx	value
Version	2-0-0
PublicationType	Full
PublicationSource	TEST123
PublicationDate	2022-05-23
PublicationSeqNum	1659
FileCreationDateTime	2022-05-23T22:14:47
FileCreationDateTime	287654
ContentDescription	FullFile_20220523

The eagle eyed among you will notice I've left out PrimaryRoleScope. I would ideally like to treat this separately in it's own dataframe as well. But I am unsure how to exclude it when pulling in the rest of the Manifest section.

Many thanks if you've read this far, even more thanks for any help.

Please see [here](https://stackoverflow.com/a/69859621/1422451) explaining that `read_xml` does not parse beyond its immediate descendants. — Parfait, Jun 15 '22 at 00:13

zx485 · Accepted Answer · 2022-06-14T23:58:32.253

One possibility is using the stylesheet parameter to transform the XML data internally with XSLT before processing it.

So your code could look like this:

dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
    dataFolder+'Data_Full_20220523.xml', 
    stylesheet='transform.xslt',
    xpath='/Root/Item', 
    attrs_only=True ,
    )
print(df_bulk.head(10))

The stylesheet(transform.xml) to be passed to read_xml could be (lxml is required)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>    
       
    <xsl:template match="/">
        <Root><xsl:apply-templates /></Root>
    </xsl:template>

    <xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
        <Item name="{name()}" value="{@value}" />
    </xsl:template>
    
</xsl:stylesheet>

In this example a new XML like the following is created. It is intermediate XML and not shown, but the xpath= parameter above has to be set accordingly.

<Root>
    <Item name="Version" value="2-0-0"/>
    <Item name="PublicationType" value="Full"/>
    <Item name="PublicationSource" value="TEST123"/>
    <Item name="PublicationDate" value="2022-05-23"/>
    <Item name="PublicationSeqNum" value="1659"/>
    <Item name="FileCreationDateTime" value="2022-05-23T22:14:47"/>
    <Item name="RecordCount" value="287654"/>
    <Item name="ContentDescription" value="FullFile_20220523"/>
</Root>

And the final output is

                   name                value
0               Version                2-0-0
1       PublicationType                 Full
2     PublicationSource              TEST123
3       PublicationDate           2022-05-23
4     PublicationSeqNum                 1659
5  FileCreationDateTime  2022-05-23T22:14:47
6           RecordCount               287654
7    ContentDescription    FullFile_20220523

The above approach uses only attributes, but you could also create an element structure with the XSLT if you prefer that. In this case change one template to

<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
    <Item>
        <name><xsl:value-of select="name()" /></name>
        <value><xsl:value-of select="@value" /></value>
   </Item>
</xsl:template>

and your python code to

dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
    dataFolder+'Data_Full_20220523.xml', 
    stylesheet='transform.xslt',
    xpath='/Root/Item', 
    )
print(df_bulk.head(10))

The output is the same.

Python + XML documents

1 Answers1

Linked