0

I'm trying to parse an XML-like file (with no associated DTD) with pyparsing. Part of each record looks has the following contents:

  • Something within <L> and <L/> tags,
  • One or more things within <pc> and <pc/> tags,
  • Optionally, something within <MW> and <MW/> tags,
  • Optionally, a literal <mul/>, and optionally a literal <mat/>

The ordering of these elements varies.

So I wrote the following (I'm new to pyparsing; please point out if I'm doing something stupid):

#!/usr/bin/env python

from pyparsing import *

def DumbTagParser(tag):
    tag_close = '</%s>' % tag
    return Group(
             Literal('<') + Literal(tag).setResultsName('tag') + Literal('>')
           + SkipTo(tag_close).setResultsName('contents') 
           + Literal(tag_close)
           ).setResultsName(tag)

record1 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
          DumbTagParser('L') & \
          Optional(Literal('<mul/>')) & \
          Optional(DumbTagParser('MW')) & \
          Optional(Literal('<mat/>')) 

record2 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
          Optional(DumbTagParser('MW')) & \
          Optional(Literal('<mul/>')) & \
          DumbTagParser('L') 

def attempt(s):
    print 'Attempting:', s
    match = record1.parseString(s, parseAll = True)
    print 'Match: ', match
    print

attempt('<L>1.1</L>')
attempt('<pc>Page1,1</pc>  <pc>Page1,2</pc> <MW>000001</MW> <L>1.1</L>')
attempt('<mul/><MW>000003</MW><pc>1,1</pc><L>3.1</L>')
attempt('<mul/> <MW>000003</MW> <pc>1,1</pc> <L>3.1</L> ')  # Note end space

Both parsers record1 and record2 fail, with different exceptions. With record1, it fails on the last string (which differs from the penultimate string only in spaces):

pyparsing.ParseException:  (at char 47), (line:1, col:48)

and with record2, it fails on the penultimate string itself:

pyparsing.ParseException: Missing one or more required elements (Group:({"<" "L" ">" SkipTo:("</L>") "</L>"})) (at char 0), (line:1, col:1)

Now what is weird is that if I interchange lines 2 and 3 in the definition of record2, then it parses fine!

record2 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
          Optional(Literal('<mul/>')) & \
          Optional(DumbTagParser('MW')) & \
          DumbTagParser('L')    # parses my example strings fine

(Yes I realise that record2 doesn't contain any rule for <mat/>. I'm trying to get a minimal example that reflects this sensitivity to reordering.)

I'm not sure if this is a bug in pyparsing or in my code, but my real question is how I should parse the kind of strings I want.

ShreevatsaR
  • 38,402
  • 17
  • 102
  • 126
  • Why the downvote? I tried hard to come up with a good question, and stripped down my code to a minimal example, and everything. :-) – ShreevatsaR Oct 31 '14 at 14:25

1 Answers1

1

I don't know if you still want an answer but here is my bash...

I can see the following problems in your code are as follows :

  • You've asigned resultsName multiple times to multiple items, as a Dict could eventually be returned you must either add '*' to each occurence of resultsName or drop it from a number of elements. I'll assume you are after the content and not the tags and drop their names. FYI, The shortcut for setting parser.resultsName(name) is parser(name).
  • Setting the resultsname to 'Contents' for everything is also a bad idea as we would loose information already available to us. Rather name CONTENTS by it's corresponding TAG.
  • You are also making multiple items Optional within the0 ZeroOrMore, they are already 'optional' through the ZeroOrMore, so let's allow them to be variations using the '^' operator as there is no predefined sequence ie. pc tags could precede mul tags or vice versa. It seems reasonable to allow any combintation and collect these as we go by.
  • As we also have to deal with multiples of a given tag we append '*' to the CONTENTS' resultsName so that we can collect the results into lists.

First we create a function to create set of opening and closing tags, your DumbTagCreator is now called tagset :

from pyparsing import *  

def tagset(str, keywords = False):
 if keywords :
  return [Group(Literal('<') + Keyword(str) + Literal('>')).suppress(), 
          Group(Literal('</') + Keyword(str) + Literal('/>')).suppress()]
 else :
  return [Group(Literal('<') + Literal(str) + Literal('>')).suppress(), 
          Group(Literal('</') + Literal(str) + Literal('>')).suppress()]

Next we create the parser which will parse <tag\>CONTENT</tag>, where CONTENT is the content we have an interest in, to return a dict so that we have {'pc' : CONTENT, 'MW' : CONTENT, ...}:

tagDict = {name : (tagset(name)) for name in ['pc','MW','L','mul','mat']}

parser = None
for name, tags in tagDict.iteritems() : 
 if parser : 
  parser = parser ^ (tags[0] + SkipTo(tags[1])(name) + tags[1])
 else :
  parser = (tags[0] + SkipTo(tags[1])(name) + tags[1])

# If you have added the </mul> tag deliberately...
parser = Optional(Literal('<mul/>')) + ZeroOrMore(parser)

# If you have added the </mul> tag by acccident...
parser = ZeroOrMore(parser)

and finally we test :

test = ['<L>1.1</L>',
 '<pc>Page1,1</pc>  <pc>Page1,2</pc> <MW>000001</MW> <L>1.1</L>',
 '<mul/><MW>000003</MW><pc>1,1</pc><L>3.1</L>',
 '<mul/> <MW>000003</MW> <pc>1,1</pc> <L>3.1</L> ']  

for item in test :  
 print {key:val.asList() for key,val in parser.parseString(item).asDict().iteritems()}

which should produce, assuming you want a dict of lists :

{'L': ['1.1']}
{'pc': ['Page1,1', 'Page1,2'], 'MW': ['000001'], 'L': ['1.1']}
{'pc': ['1,1'], 'MW': ['000003'], 'L': ['3.1']}
{'pc': ['1,1'], 'MW': ['000003'], 'L': ['3.1']}
Carel
  • 3,289
  • 2
  • 27
  • 44