How to parse the following multi-dimensional xml-like data into a dataframe

Question

I have explored an XML based API for work-related things, it comes from warehouse data. Ideally I want to do some analysis in python with pandas.

Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= UTC )) , None,  StringAggregateDimensionValue(value=u'VIRTUALLY_LABELED_CASE') ], quantity=127) ,  

Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= UTC )) ,  StringAggregateDimensionValue(value=u'PPTransMergeNonCon') ,  StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW') ], quantity=15)   

Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 27, 21, 0, tzinfo= UTC )) ,  StringAggregateDimensionValue(value=u'PPTransFRA1') ,  StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW') ], quantity=8) ,

The data looks like the above stream, after I did some find and replace in VIM (I know i can just script this in python). How do I best get this weird-format into Pandas? I ideally want datetime, the String aggregatedimension value, and the quantity. But there is a lot of None, in this parse-needed data. In a dataframe it'll be easy to do some analysis, but I'm a bit stumped here (and feel a lot like a n00b).

EDIT: Here is the unregexed and un-replaced data that I get and want to parse. It isn't really XML so XML doesn't work.

[<DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 26, 20, 30, tzinfo=<UTC
>))>, <StringAggregateDimensionValue(value=u'PPTransCGN1')>, <
StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=992)>, <
StringAggregateDimensionValue(value=u'PPTransLEJ1')>, <StringAggregateDimensionValue(
value=u'PRIME_BIN_RANDOM_STOW')>], quantity=945)>, <Aggregate(
aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.datetime(2013
, 8, 23, 19, 30, tzinfo=<UTC>))>, None, <StringAggregateDimensionValue(value=u'TOTE')>],
 quantity=87)>, <Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(
value=datetime.datetime(2013, 8, 27, 17, 30, tzinfo=<UTC>))>, <
StringAggregateDimensionValue(value=u'PPTransMUC3')>, <StringAggregateDimensionValue(
value=u'TOTE')>], quantity=14)>, <Aggregate(aggregate_dimension_value_list=[<
DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 27, 20, 30, tzinfo=<UTC
>))>, <StringAggregateDimensionValue(value=u'PPTransEUK5')>, <
StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=339)>, <
Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.
datetime(2013, 8, 26, 20, 30, tzinfo=<UTC>))>, <StringAggregateDimensionValue(value=u
'PPTransCGN1')>, <StringAggregateDimensionValue(value=u'TOTE')>], quantity=1731)>, <
Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.
datetime(2013, 8, 26, 19, 30, tzinfo=<UTC>))>, <StringAggregateDimensionValue(value=u
'PPTransEUK5')>, quantity=444)>, <Aggregate(aggregate_dimension_value_list=[<
DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<UTC
>))>, <StringAggregateDimensionValue(value=u'PPTransEUK5')>, <
StringAggregateDimensionValue(value=u'TOTE')>], quantity=28)>, <Aggregate(
aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.datetime(2013
, 8, 28, 19, 30, tzinfo=<UTC>))>, <StringAggregateDimensionValue(value=u'PPTransORY1')>,
 <StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=69)>, <
Aggregate(aggregate_dimension_value_list=<Aggregate(aggregate_dimension_value_list=[<
DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<UTC
>))>, <StringAggregateDimensionValue(value=u'PPTransMAD4')>, <
StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=47)>, <
Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.
datetime(2013, 8, 26, 21, 0, tzinfo=<UTC>))>, None, None], quantity=78)>

did it start off as actual xml, why not use an xml parser, like in this question: http://stackoverflow.com/a/16993660/1240268 — Andy Hayden, Aug 26 '13 at 09:10
It isn't actually XML, it is 'like XML' but a bit more complicated than that. When I try parsing it like XML, I get errors... — Peadar Coyle, Aug 26 '13 at 09:42
I recommend looking at the errors, parsing the xml(-like) would be preferable (faster and safer). You could hack it by defining these functions (which are mostly the identity or tuple), then evaling (but be careful if this data is not from a trusted source) and using from_records. — Andy Hayden, Aug 26 '13 at 10:42

PaulMcG · Accepted Answer · 2013-08-27T11:09:02.340

If you would prefer something more along the lines of a parser, here is a pyparsing stab at your problem:

from pyparsing import Suppress,QuotedString,Word,alphas,nums,alphanums,Keyword,Optional
import datetime

# define UTC timezone for sake of eval
if hasattr(datetime,"timezone"):
    UTC = datetime.timezone(datetime.timedelta(0),"UTC")
else:
    UTC = None

_ = Suppress
evaltokens = lambda s,l,t: eval(''.join(t))

timevalue = 'datetime.datetime' + QuotedString('(', endQuoteChar=')', unquoteResults=False)
timevalue.setParseAction(evaltokens)

strvalue = 'u' + QuotedString("'", unquoteResults=False)
strvalue.setParseAction(evaltokens)

nonevalue = Keyword("None").setParseAction(lambda s,l,t: [None])
intvalue = Word(nums).setParseAction(lambda s,l,t: int(t[0]))

COMMA = Optional(_(","))

valuedexpr = lambda expr: (Word(alphas) + "(" + "value" + "=" + expr + ")").setParseAction(lambda t: t[4])

lineexpr = (_("Aggregate(aggregate_dimension_value_list=[") +
            valuedexpr(timevalue)("timestamp") + COMMA +
            (nonevalue | valuedexpr(strvalue))("s1") + COMMA +
            (nonevalue | valuedexpr(strvalue))("s2") + COMMA +
        "]" + COMMA +
        "quantity=" + intvalue("qty"))

Use lineexpr.searchString to pull the data out of each Aggregate:

for data in lineexpr.searchString(sample):
    print data.dump()
    print data.qty
    print

giving:

[datetime.datetime(2013, 8, 28, 19, 30), None, u'VIRTUALLY_LABELED_CASE', ']', 'quantity=', 127]
- qty: 127
- s1: None
- s2: VIRTUALLY_LABELED_CASE
- timestamp: 2013-08-28 19:30:00
127

[datetime.datetime(2013, 8, 28, 19, 30), u'PPTransMergeNonCon', u'PRIME_BIN_RANDOM_STOW', ']', 'quantity=', 15]
- qty: 15
- s1: PPTransMergeNonCon
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-28 19:30:00
15

[datetime.datetime(2013, 8, 27, 21, 0), u'PPTransFRA1', u'PRIME_BIN_RANDOM_STOW', ']', 'quantity=', 8]
- qty: 8
- s1: PPTransFRA1
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-27 21:00:00
8

dump() will show all the named results values that are available to you - note how the quantity attribute can be accessed directly using data.qty. This was setup for you with the definition of the results name "qty" in "quantity=" + intvalue("qty"). timestamp, s1, and s2 can be accessed similarly. (There is still a little evaling in this, cleaning that up is left as an exercise for the reader.)

EDIT:

Here is the modified pyparsing parser, to process your original raw XML-like stuff. The changes were really pretty minor:

from pyparsing import Suppress,QuotedString,Word,alphas,nums,alphanums,Keyword,Optional, ungroup
import datetime

# define UTC timezone for sake of eval
if hasattr(datetime,"timezone"):
    UTC = datetime.timezone(datetime.timedelta(0),"UTC")
else:
    UTC = None

_ = Suppress
evaltokens = lambda s,l,t: eval(''.join(t))

timevalue = 'datetime.datetime' + QuotedString('(', endQuoteChar=')', unquoteResults=False)
replUTC = lambda s,l,t: ''.join(t).replace("< UTC>","UTC").replace("<UTC >","UTC").replace("<UTC>","UTC")
timevalue.setParseAction(replUTC, evaltokens)

strvalue = 'u' + QuotedString("'", unquoteResults=False)
strvalue.setParseAction(evaltokens)

nonevalue = Keyword("None").setParseAction(lambda s,l,t: [None])
intvalue = Word(nums).setParseAction(lambda s,l,t: int(t[0]))

COMMA = Optional(_(","))
LT,GT,LPAR,RPAR,LBRACK,RBRACK = map(Suppress,"<>()[]")

#~ valuedexpr = lambda expr: (Word(alphas) + "(" + "value" + "=" + expr + ")").setParseAction(lambda t: t[4])
valuedexpr = lambda expr: ungroup(LT + (Word(alphas) + "(" + "value" + "=" + expr("value") + ")" + GT).setParseAction(lambda t: t.value))

#~ lineexpr = (_("Aggregate(aggregate_dimension_value_list=[") +
            #~ valuedexpr(timevalue)("timestamp") + COMMA +
            #~ (nonevalue | valuedexpr(strvalue))("s1") + COMMA +
            #~ (nonevalue | valuedexpr(strvalue))("s2") + COMMA +
        #~ "]" + COMMA +
        #~ "quantity=" + intvalue("qty"))

lineexpr = (LT + "Aggregate" + LPAR + "aggregate_dimension_value_list" + "=" + LBRACK +
            valuedexpr(timevalue)("timestamp") + COMMA +
            (nonevalue | valuedexpr(strvalue))("s1") + COMMA +
            (nonevalue | valuedexpr(strvalue))("s2") + 
        RBRACK + COMMA +
        "quantity=" + intvalue("qty") + RPAR + GT)

From your pasted text (some of which was malformed), gives:

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'PPTransCGN1', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 992]
- qty: 992
- s1: PPTransCGN1
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-26 20:30:00
992

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 23, 19, 30), None, u'TOTE', 'quantity=', 87]
- qty: 87
- s1: None
- s2: TOTE
- timestamp: 2013-08-23 19:30:00
87

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 17, 30), u'PPTransMUC3', u'TOTE', 'quantity=', 14]
- qty: 14
- s1: PPTransMUC3
- s2: TOTE
- timestamp: 2013-08-27 17:30:00
14

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 20, 30), u'PPTransEUK5', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 339]
- qty: 339
- s1: PPTransEUK5
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-27 20:30:00
339

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'PPTransCGN1', u'TOTE', 'quantity=', 1731]
- qty: 1731
- s1: PPTransCGN1
- s2: TOTE
- timestamp: 2013-08-26 20:30:00
1731

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'PPTransEUK5', u'TOTE', 'quantity=', 28]
- qty: 28
- s1: PPTransEUK5
- s2: TOTE
- timestamp: 2013-08-26 19:30:00
28

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 28, 19, 30), u'PPTransORY1', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 69]
- qty: 69
- s1: PPTransORY1
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-28 19:30:00
69

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'PPTransMAD4', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 47]
- qty: 47
- s1: PPTransMAD4
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-26 19:30:00
47

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 21, 0), None, None, 'quantity=', 78]
- qty: 78
- s1: None
- s2: None
- timestamp: 2013-08-26 21:00:00
78

I don't seem able to replicate this tip. What version of PyParsing are you using I'm using 1.5.7 i get no error message though... just a blank file when I print my results to a text file.... — Peadar Coyle, Aug 26 '13 at 13:58
Pyparsing 2.0.1, Python 3.3 - but there is nothing special in this parser that is version specific, this should run just fine with 1.5.7. (If you are using Python 2.6 or 2.7, you can safely upgrade to Pyparsing 2.0.1.) — PaulMcG, Aug 26 '13 at 17:48
Any suggestions Paul for making this more Object orientated. I've tried a bit but I keep running into blockers. — Peadar Coyle, Sep 24 '13 at 16:04

score 0 · Answer 2 · answered Aug 26 '13 at 12:17

You can define minimal classes for Aggregate, DateAggregateDimensionValue, and StringAggregateDimensionValue, then eval each line in turn:

import datetime

# define UTC timezone for sake of eval
if hasattr(datetime,"timezone"):
    UTC = datetime.timezone(datetime.timedelta(0),"UTC")
else:
    UTC = None

# define minimal classes to eval initializers
class AggregateDimensionValue(object):
    def __init__(self, value):
        self.value = value
class StringAggregateDimensionValue(AggregateDimensionValue): pass
class DateAggregateDimensionValue(AggregateDimensionValue): pass
class Aggregate(object):
    def __init__(self, aggregate_dimension_value_list, quantity):
        self.timestamp, self.s1, self.s2 = aggregate_dimension_value_list
        # pull values out of parsed "aggregate" instances
        self.timestamp = self.timestamp.value
        if self.s1 is not None:
            self.s1 = self.s1.value
        if self.s2 is not None:
            self.s2 = self.s2.value
        self.quantity = quantity

Use these minimal classes to eval the input strings:

for line in sample.splitlines():
    if not line.strip():
        continue
    obj = eval(line.strip(' ,'))
    print obj.__dict__

Gives:

{'timestamp': datetime.datetime(2013, 8, 28, 19, 30), 's1': None, 'quantity': 127, 's2': u'VIRTUALLY_LABELED_CASE'}
{'timestamp': datetime.datetime(2013, 8, 28, 19, 30), 's1': u'PPTransMergeNonCon', 'quantity': 15, 's2': u'PRIME_BIN_RANDOM_STOW'}
{'timestamp': datetime.datetime(2013, 8, 27, 21, 0), 's1': u'PPTransFRA1', 'quantity': 8, 's2': u'PRIME_BIN_RANDOM_STOW'}

Of course, this comes with all the usual caveats about using eval, such as beware of any possible injection of malicious code. But I suspect you are already in control of this input file yourself, so if you inject your own malicious code, you have only yourself to blame.

How to parse the following multi-dimensional xml-like data into a dataframe

2 Answers2