Parsing xml to pandas data frame throws memory error

Question

I am trying to put 1. a parent attribute 2. a child attribute and 3. a grandchild text into a data frame. I am able to get the child attribute and the grandchild text to print out on the screen, but I cannot get them to go into a data frame. I get a memory error from pandas.

Here is intro stuff

import requests
from lxml import etree, objectify
r = requests.get('https://api.stuff.us/place/getData?   security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy
root = etree.fromstring(r.text)
xml_new = etree.tostring(root, pretty_print=True)
print xml_new[300:900] #gives xml output to show structure
<startTime>2013-05-01 00:00:00</startTime>
<endTime>2013-05-01 23:59:00</endTime>
<summaryPeriod>minutes</summaryPeriod>
<data>
  <channel channel="97925" name="blah"> 
    <Time Time="2013-05-01 00:00:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:01:00">
      <value>259</value>
    </Time>
    <Time Time="2013-05-01 00:02:00">
      <value>258</value>
    </Time>
    <Time Time="2013-05-01 00:03:00">
      <value>257</value>
    </Time>

This shows how I am parsing to get the child attribute and grandchild to print.

for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel/Time
    for attrib in df.attrib:
            print '@' + attrib + '=' + df.attrib[attrib]
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
            print 'subfield=' + subfield.text

It yields a very long print out with the information as requested:

...
@Time=2013-05-01 23:01:00
value=100
@Time=2013-05-01 23:02:00
value=101
@Time=2013-05-01 23:03:00
value=99
@Time=2013-05-01 23:04:00
value=101
...

However, when I try to put it into a data frame, I get a memory error. I tried with both of them an also with just trying to get the child attribute into a data frame.

data = []
for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel/Time
    for attrib in df.attrib:
        el_data = {}
        el_data[attrib] = df.attrib[attrib]
    data.append(el_data)
from pandas import *
perf = DataFrame(data)
perf

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-6-08c8c74f7192> in <module>()
      1 from pandas import *
----> 2 perf = DataFrame(data)
      3 perf

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-    packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    417 
    418                 if isinstance(data[0], (list, tuple, collections.Mapping, Series)):
--> 419                     arrays, columns = _to_arrays(data, columns, dtype=dtype)
    420                     columns = _ensure_index(columns)
    421 

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   5457         return _list_of_dict_to_arrays(data, columns,
   5458                                        coerce_float=coerce_float,
-> 5459                                        dtype=dtype)
   5460     elif isinstance(data[0], Series):
   5461         return _list_of_series_to_arrays(data, columns,

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-    packages/pandas/core/frame.pyc in _list_of_dict_to_arrays(data, columns, coerce_float, dtype)
   5521             for d in data]
   5522 
-> 5523     content = list(lib.dicts_to_array(data, list(columns)).T)
   5524     return _convert_object_array(content, columns, dtype=dtype,
   5525                                  coerce_float=coerce_float)

/Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.dicts_to_array (pandas/lib.c:7657)()

MemoryError:

I have 12960 values of "value" in my xml file. I assume that these memory errors are telling me something about the values in the file not meeting what is expected, but that doesn't match with a memory error, and I could not figure it out from other SO questions regarding memory errors or from the pandas documentation.

An attempt to get the data types yields no information. Maybe there are no types? Perhaps because they are elements in an element tree. (I tried to print .pyval, but it only told me there was no attribute.) el_data is of type "dict"

print(objectify.dump(root))[700:1000] #print a subset of types
name = 'zone'
            Time = None [_Element]
              * Time = '2013-05-01 00:00:00'
                value = '258' [_Element]
            Time = None [_Element]
              * Time = '2013-05-01 00:01:00'
                value = '259' [_Element]
type(el_data)
dict

I built this code based on the book Python for Data Analysis and other examples found on SO for parsing XML. I am still new to python.

Running Python 2.7.2 on Mac OS 10.7.5

Seeing as how you're using a 32-bit build, the result (or the intermediary list of dicts) may simply be larger than 2GB. — Joe Kington, Jun 04 '13 at 16:06
here's another question like this: http://stackoverflow.com/questions/10947968/xml-to-pandas-dataframe, you also need to make sure you have base dtypes, (e.g. if they are strings, then ``str(attrributes)`` as the etree are not strings (though they print like strings), and so you are getting a huge object (and not a string) — Jeff, Jun 04 '13 at 16:09
pls show the first couple of elements of your data list, and their type, e.g. ``data[0],type(data[0])``.... — Jeff, Jun 04 '13 at 16:23
Jeff, I think that is the problem. These are not strings or values rather, they are "elements" (edited question to show). — jessi, Jun 04 '13 at 16:41
just ``str(element)`` should work (as you are creating the dict) — Jeff, Jun 04 '13 at 17:00
Well, given what @Jeff mentioned, the reason it's going over 2GB is the fact that each key in your dict is a unique object (e.g. one "Time" key is not the same as the next "Time" key because each oen is a separate dict). This means that pandas is trying to build a 12960x12960 DataFrame instead of a 12960x2 DataFrame. — Joe Kington, Jun 04 '13 at 17:11
Thanks -> I solved it by putting each into lists first and then compiling the data frame. Thanks for the link to the other question. I appreciate the help! — jessi, Jun 04 '13 at 17:16

jessi · Accepted Answer · 2013-06-04T19:48:10.633

Answer based on help from Jeff and JoeKington. The data needed to be put into lists separately before being pushed into the dataframe. The memory error was being caused by the multiple "elements" which were not able to be put into a data frame. Instead, each element dict needs to be put into a list which can go into a data frame.

This works:

dTime=[]
dvalue=[]
for df in root.xpath('//channel/Time'):
    ## Iterate over attributes of channel
    for attrib in df.attrib:
    dTime.append(df.attrib[attrib])
    ## value is a child of time, and iterate
    subfields = df.getchildren()
    for subfield in subfields:
    dvalue.append(subfield.text)
pef=DataFrame({'Time':dTime,'values':dvalue})

pef

&ltclass 'pandas.core.frame.DataFrame'&gt
Int64Index: 12960 entries, 0 to 12959
Data columns (total 2 columns):
Time     12960  non-null values
value    12960  non-null values
dtypes: object(2) 

pef[:5]

    Time                    value
0    2013-05-01 00:00:00    258
1    2013-05-01 00:01:00    259
2    2013-05-01 00:02:00    258
3    2013-05-01 00:03:00    257
4    2013-05-01 00:04:00    257

Glad you solved your problem! (and +1) Also, just so you know, you can (and should) mark your answer as the "accepted" answer. — Joe Kington, Jun 04 '13 at 18:00
It won't let me accept for two days. I'll do it then. Thanks, @JoeKington — jessi, Jun 04 '13 at 19:40

Parsing xml to pandas data frame throws memory error

1 Answers1

Linked