pandas read_xml : column gets overwritten by elements with same tag (but different attributes)

Question

I'm trying to convert a xml file into a pandas dataframe with the read_xml function. The problem is that one of the nested elements is repeated several times (but with different attributes), and it seems that pandas creates only one column for this element that gets written over every time until it gets to the last one.

Here is the contents of the xml file I used:

<?xml version="1.0" encoding="UTF-8"?>
  <histdata>
   <item>
    <datetime>20-1-2023 00:00:00</datetime>
    <value channel="Traffic Total" chid="-1">92 kbit/s</value>
    <value channel="Traffic In" chid="0">77 kbit/s</value>
    <value channel="Traffic Out" chid="1">16 kbit/s</value>
    <value channel="Downtime" chid="-4">0 %</value>
    <coverage>100 %</coverage>
   </item>
   <item>
    <datetime>20-1-2023 00:05:00</datetime>
    <value channel="Traffic Total" chid="-1">82 kbit/s</value>
    <value channel="Traffic In" chid="0">727 kbit/s</value>
    <value channel="Traffic Out" chid="1">18 kbit/s</value>
    <value channel="Downtime" chid="-4">0 %</value>
    <coverage>100 %</coverage>
   </item>
  </histdata>

simply reading the xml file with read_xml with:

import pandas as pd
df = pd.read_xml('chdata.xml')
df

gives: enter image description here

As you can see, there is only one column for 'value' where I would expect one column for every value element in the item tag of the xml file (maybe with the attribute in the column name to distinct them from the others).

I know I can retrieve the values of these elements with the following code:

chdata = pd.read_xml('chdata.xml', xpath='//value')
chdata

This gives me a dataframe containing all value elements, as shown here: enter image description here

What I actually need is one dataframe that shows the information from both previous df's in one table. The one 'value' column from 'df' should be replaced by four columns containing the four 'value' lines from each item in the xml. One of the attributes (channel name or id) could be used as a column name.

Preferably the table should look something like this:

datetime	chid (or name) from 1st value	same from 2nd	3rd	4th	coverage

Can anyone help me to achieve this? If it's possible with just the tools provided by python and pandas that would be nice, but any solution is welcome.

Tranbi · Answer 1 · 2023-02-10T16:00:38.617

You can specify the objects you want to parse with xpath:

df = pd.read_xml('chdata.xml', xpath='//item/*')

will give you:

              datetime        channel  chid       value coverage
0   20-1-2023 00:00:00           None   NaN        None     None
1                 None  Traffic Total  -1.0   92 kbit/s     None
2                 None     Traffic In   0.0   77 kbit/s     None
3                 None    Traffic Out   1.0   16 kbit/s     None
4                 None       Downtime  -4.0         0 %     None
5                 None           None   NaN        None    100 %
6   20-1-2023 00:05:00           None   NaN        None     None
7                 None  Traffic Total  -1.0   82 kbit/s     None
8                 None     Traffic In   0.0  727 kbit/s     None
9                 None    Traffic Out   1.0   18 kbit/s     None
10                None       Downtime  -4.0         0 %     None
11                None           None   NaN        None    100 %

From there you can transform your df:

df['datetime'] = df['datetime'].ffill()
df['coverage'] = df.groupby('datetime')['coverage'].bfill()
df = (df.drop(columns='chid').dropna()
    .pivot(index=['datetime', 'coverage'], columns='channel', values='value')
    .reset_index(level='coverage'))

Output:

channel            coverage Downtime  Traffic In Traffic Out Traffic Total
datetime                                                                  
20-1-2023 00:00:00    100 %      0 %   77 kbit/s   16 kbit/s     92 kbit/s
20-1-2023 00:05:00    100 %      0 %  727 kbit/s   18 kbit/s     82 kbit/s

Thank you for your quick reply. First part worked immediately. For the next part I got a 'chid not found in axis'. I'll look into that later this weekend, but I assume I can figure this out. If this works, and I get the output you showed, that was exactly what I needed. Thanks mate! — Fey010, Feb 10 '23 at 16:08
Do you use the same data as in the question? If `chid` attribute is not available you can just remove `drop(columns='chid')` — Tranbi, Feb 10 '23 at 17:16
Nevermind. I cleared my notebook variables and it was working. Cheers! — Fey010, Feb 12 '23 at 16:10

score 1 · Accepted Answer · answered Feb 11 '23 at 06:27

Consider using the names argument per docs:

Column names for DataFrame of parsed XML data. Use this parameter to rename original element names and distinguish same named elements and attributes.

items_df = pd.read_xml(
    "Input.xml", 
    xpath=".//item",
    names = [
        "datetime", "value1", "value2", "value3", "value4", "coverage"
    ]
)

print(items_df)
#              datetime     value1      value2     value3 value4 coverage
# 0  20-1-2023 00:00:00  92 kbit/s   77 kbit/s  16 kbit/s    0 %    100 %
# 1  20-1-2023 00:05:00  82 kbit/s  727 kbit/s  18 kbit/s    0 %    100 %

To retrieve the further descendants, consider iterparse, normally reserved for very large XML files but offers an alternative to parsing XML outside of XPath.

attribs_df = pd.read_xml(
    "Input.xml", 
    iterparse = {
        "item": [
            "channel", "chid", "channel", "chid", 
            "channel", "chid", "channel", "chid"
        ]
    },
    names = [
        "channel1", "chid1", "channel2", "chid2",
        "channel3", "chid3", "channel4", "chid4"
    ]
)

Then join the two sets:

items_df = items_df.join(attribs_df)

print(items_df)
#              datetime     value1      value2     value3 value4 coverage       channel1  chid1    channel2  chid2     channel3  chid3  channel4  chid4
# 0  20-1-2023 00:00:00  92 kbit/s   77 kbit/s  16 kbit/s    0 %    100 %  Traffic Total     -1  Traffic In      0  Traffic Out      1  Downtime     -4
# 1  20-1-2023 00:05:00  82 kbit/s  727 kbit/s  18 kbit/s    0 %    100 %  Traffic Total     -1  Traffic In      0  Traffic Out      1  Downtime     -4

Note: All should be achievable with a single iterparse call by including datetime, value, and coverage but doing so repeats and affects order which may be a bug.

Very clean solution! The other answer I got also gets the job done, but this takes less steps, and is easier for me to understand. For now, I don't need to go down to further descendants, but I'll look into iterparse. Thanks again! — Fey010, Feb 12 '23 at 16:22
Thank you, too! The `iterparse` behavior on repeating elements may be a bug and currently a posted [pandas issue](https://github.com/pandas-dev/pandas/issues/51183)! — Parfait, Feb 12 '23 at 16:52

pandas read_xml : column gets overwritten by elements with same tag (but different attributes)

2 Answers2