By default, pandas.read_xml
parses all the immediate descendants of a set of nodes including its child nodes and attributes. Unless, the xpath
argument indicates it, read_xml
will not go further than immediate descendants.
To illustrate your use case. Below is likely the generalized set up of your XML where <Type>
and its siblings, <Activation>
and <Amt>
are parsed. However, <Amt>
does not contain a text node, only attributes. So the value in that column should be empty.
<root>
<row>
<Type>PYI</Type> <!-- Type IS A CHILD NODE OF row -->
<Activation>N</Activation> <!-- Activation IS A CHILD NODE OF row -->
<Amt val="4000" curr="GBP"/> <!-- Amt IS A CHILD NODE OF row -->
</row>
</root>
But then you ask, why did read_xml
ignore the val and curr attributes? Because each are not an immediate descendant of <row>
. They are descendants of <Amt>
(i.e., grandchildren of <row>
). If attributes were moved to <row>
, then they will be captured as shown below:
<root>
<row val="4000" curr="GBP"> <!-- val AND curr ARE CHILD ATTRIBS OF row -->
<Type>PYI</Type> <!-- Type IS A CHILD NODE OF row -->
<Activation>N</Activation> <!-- Activation IS A CHILD NODE OF row -->
<Amt/> <!-- Amt IS A CHILD NODE OF row -->
</row>
</root>
To capture those attributes, adjust xpath
argument to point to its immediate parent:
amt_df = pd.read_xml("Input.xml", xpath="//Amt")
To have such attributes captured with <row>
level information, consider the special-purpose language, XSLT, to transform your original XML to the following:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<row>
<Type>PYI</Type>
<Activation>N</Activation>
<Amt_val>4000</Amt_val>
<Amt_curr>GBP</Amt_curr>
</row>
</root>
Above is the intermediate output that is parsed by read_xml
when using the stylesheet
argument as shown below:
xsl = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="row">
<xsl:copy>
<xsl:copy-of select="*[name() != 'Amt']"/>
<Amt_val><xsl:value-of select="Amt/@val"/></Amt_val>
<Amt_curr><xsl:value-of select="Amt/@curr"/></Amt_curr>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>'''
row_df = pd.read_xml("Input.xml", xpath="//row", stylesheet=xsl")