Extracting XML into data frame with parent attribute as column title

Question

I have thousands of XML files that I will be processing, and they have a similar format, but different parent names and different numbers of parents. Through books, google, tutorials, and just trying out codes, I've been able to pull out all of this data. See, for example: Parsing xml to pandas data frame throws memory error and Dynamic search through xml attributes using lxml and xpath in python

However, I realized that I was extracting the data poorly, with a child "Time" repeated for each parent.

Here is what I am trying to get.

Time   blah   abc
1200   100   2
1300   30    4
1400   70    2

Here is what I know how to get. But my current method is clunky (I'll show below the example XML)

    child      Time   grandchild
0     blah     1200    100
1     blah     1300    30
...
n-2   abc      1200    2
n-1   abc      1300    4
n     abc      1400    2

Example XML format

<outer>
   <inner>
      <parent name = "blah" id = "1"> 
         <child Time = "1200"> 
            <grandchild>100</grandchild>  
         </child>
         <child Time = "1300">
            <grandchild>30</grandchild>
         </child>
         <child Time = "1400">
            <grandchild>70</grandchild>
         </child>
      </parent>
      <parent name = "abc" id = "2"> 
         <child Time = "1200">   
            <grandchild>2</grandchild> 
         </child>
         <child Time = "1300">
            <grandchild>4</grandchild>
         </child>
         <child Time = "1400">
            <grandchild>2</grandchild>
         </child>
      </parent>      
      <parent name = "1234" id = "7734"> 
         <other> 12 </other>
      </parent> 
   </inner>
</outer>

Here is how I can get my output:

from lxml import etree, objectify
from pandas import *
dTime=[]
dparent = []
dgrandchild=[]
for df in root.xpath('/*/*/*/parent/child'):
    dparent.append(df.getparent().attrib['name'])
    ## Iterate over attributes of time for specific parent
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
        ## grandchild is a child of time, and iterate
        subfields = df.getchildren()
        for subfield in subfields:
         dgrandchild.append(subfield.text)
df=DataFrame({'Parent': dparent,'Time':dTime,'grandchild':dgrandchld})

I could just take this output and re-shape it, but that seems inefficient and a very clunky approach.

I think I need something of the flavor:

#this does not work
data = []
for elem in root.xpath('/*/*/*/parent/child'):
   elem_data = {}
   for attrib in elem.attrib:
       elem_data['Time'] = elem.attrib[attrib])
   for child in elem.getchildren():
       elem_data[getparent().attrib['name'])] = child.text
       data.append(elem_data)
ndata = DataFrame(data)

can u post some *valid* xml so that people can copy and paste? — Phillip Cloud, Jun 07 '13 at 21:44
snap @cpcloud. I tried to simplify it and left out the quotes on the attributes. I've edited it. Thanks. — jessi, Jun 09 '13 at 17:51

Andy Hayden · Accepted Answer · 2013-06-09T22:01:06.920

I recommend just parsing to a DataFrame first, similar to how you are already (see below for my implementation) and then tweaking it to your requirements.

Then you're looking for a pivot:

In [11]: df
Out[11]:
  child  Time  grandchild
0  blah  1200         100
1  blah  1300          30
2   abc  1200           2
3   abc  1300           4
4   abc  1400           2

In [12]: df.pivot('Time', 'child', 'grandchild')
Out[12]:
child  abc  blah
Time
1200     2   100
1300     4    30
1400     2   NaN

I recommend first parse from a file and take out the things you want into a list of tuples:

from lxml import etree
root = etree.parse(file_name)

parents = root.getchildren()[0].getchildren()

In [21]: elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
                      for p in parents
                      for c in p
                      for gc in c]

In [22]: elems
Out[22]:
[('blah', 1200, 100),
 ('blah', 1300, 30),
 ('blah', 1400, 70),
 ('abc', 1200, 2),
 ('abc', 1300, 4),
 ('abc', 1400, 2)]

For multiple files you could just whack it in an even longer list comprehension. Which shouldn't be too slow unless you have a huge number of xmls (here files is the list of xmls)...

elems = [(p.attrib['name'], int(c.attrib['Time']), int(gc.text))
            for f in files
            for p in etree.parse(f).getchildren()[0].getchildren()
            for c in p
            for gc in c]

Put them in a DataFrame:

In [23]: pd.DataFrame(elems, columns=['child', 'Time', 'grandchild'])
Out[23]:
  child  Time grandchild
0  blah  1200        100
1  blah  1300         30
2  blah  1400         70
3   abc  1200          2
4   abc  1300          4
5   abc  1400          2

then do the pivot. :)

That works, @Andy Hayden. That was the approach that I was going to use if I could not get any good ideas for how to get it extracted that way. Seems like it is not clean and in keeping with the python approach to extract the data one way and then process it to reshape, ya know? — jessi, Jun 09 '13 at 17:49
Well, you can try and be clever, but usually it's faster (and saner) just to read it to a DataFrame "whatever" then reshape. Atm getting your data into a DataFrame at all looks like the slow bit (for loops and appends). — Andy Hayden, Jun 09 '13 at 18:10
Cool. I thought I was doing it too clunky. I appreciate your suggestion that reshape is not too bad. I'll go with that, then! Thanks — jessi, Jun 09 '13 at 18:39
Thanks, @Andy Hayden! I implemented it. If I do think of a cleaner way to do it (once I have more experience), I'll post it here. — jessi, Jun 09 '13 at 21:38
@Jessi in fact you can use a larger list comprehension for lots of files. I think it's pretty neat :) — Andy Hayden, Jun 09 '13 at 22:02

Extracting XML into data frame with parent attribute as column title

1 Answers1

Linked