pandas.read_xml() unexpected behaviour

Question

I am trying to understand why the code:

import pandas

xml = '''
<ROOT>
  <ELEM atr="anything">1</ELEM>
  <ELEM atr="anything">2</ELEM>
  <ELEM atr="anything">3</ELEM>
  <ELEM atr="anything">4</ELEM>
  <ELEM atr="anything">5</ELEM>
  <ELEM atr="anything">6</ELEM>
  <ELEM atr="anything">7</ELEM>
  <ELEM atr="anything">8</ELEM>
  <ELEM atr="anything">9</ELEM>
  <ELEM atr="anything">10</ELEM>
</ROOT>
'''
df = pandas.read_xml(xml, xpath='/ROOT/ELEM')
print(df.to_string())

... works as expected and prints:

        atr  ELEM
0  anything     1
1  anything     2
2  anything     3
3  anything     4
4  anything     5
5  anything     6
6  anything     7
7  anything     8
8  anything     9
9  anything    10

Yet the following code:

import pandas

xml = '''
<ROOT>
  <ELEM>1</ELEM>
  <ELEM>2</ELEM>
  <ELEM>3</ELEM>
  <ELEM>4</ELEM>
  <ELEM>5</ELEM>
  <ELEM>6</ELEM>
  <ELEM>7</ELEM>
  <ELEM>8</ELEM>
  <ELEM>9</ELEM>
  <ELEM>10</ELEM>
</ROOT>
'''
df = pandas.read_xml(xml, xpath='/ROOT/ELEM')
print(df.to_string())

results in the error:

ValueError: xpath does not return any nodes or attributes. Be sure to
specify in `xpath` the parent nodes of children and attributes to
parse. If document uses namespaces denoted with xmlns, be sure to
define namespaces and use them in xpath.

I have read the documentation here: https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html

And also checked my xpath here (code above is just a minimal example, actual XML I use is more complex): https://freeonlineformatter.com/xpath-validator/

In a nutshell I need to read into pandas dataframe a list of XML child elements at a known xpath. Child elements have no attributes but all have text values. I want to get a dataframe with one column containing these valyes. What am I doing wrong?

pandas has lots of cool shortcuts that work in very specific situations. When the shortcuts don't fit your data, then you need to do it by hand. What you have there is very easy to parse with the Python `xml.etree` module, and from there it's easy to make a dataframe. — Tim Roberts, May 31 '23 at 22:15
Thanks a lot. What I have there is a minimal example of what I was trying to demonstrate. My XML is more complex. Parsing with xml.etree is something I have considered (and tried, it actually works). But first I am trying to understand whether I am doing something wrong, or is this a bug or intended behavior of pandas. Are you saying what I am trying to do above is not possible directly with pandas.read_xml()? — Art Gertner, May 31 '23 at 22:18
If you check the documentation, pandas expects the XML to have rows with columns. In your first example, each `` is a row, and the `atr` is the column. In your second example, there are no columns. If you had `1`, it should work, because VAL would be the column. — Tim Roberts, May 31 '23 at 22:22
That explains it. Though it was not immediately obvious to me from reading the documentation. If you post this as an answer and quote+link the relevant section of the documentation, I will mark this one as solved. Thanks again. — Art Gertner, May 31 '23 at 22:28

score 1 · Accepted Answer · answered May 31 '23 at 23:41

If you check the documentation, pandas expects the XML to have rows with columns. In your first example, each <ELEM> is a row, and the atr is the column. In your second example, there are no columns. If you had <ELEM><VAL>1</VAL></ELEM>, it should work, because VAL would be the column.

https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html

pandas.read_xml() unexpected behaviour

1 Answers1