I'm trying to parse an XML-like file (with no associated DTD) with pyparsing. Part of each record looks has the following contents:
- Something within
<L>
and<L/>
tags, - One or more things within
<pc>
and<pc/>
tags, - Optionally, something within
<MW>
and<MW/>
tags, - Optionally, a literal
<mul/>
, and optionally a literal<mat/>
The ordering of these elements varies.
So I wrote the following (I'm new to pyparsing; please point out if I'm doing something stupid):
#!/usr/bin/env python
from pyparsing import *
def DumbTagParser(tag):
tag_close = '</%s>' % tag
return Group(
Literal('<') + Literal(tag).setResultsName('tag') + Literal('>')
+ SkipTo(tag_close).setResultsName('contents')
+ Literal(tag_close)
).setResultsName(tag)
record1 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
DumbTagParser('L') & \
Optional(Literal('<mul/>')) & \
Optional(DumbTagParser('MW')) & \
Optional(Literal('<mat/>'))
record2 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
Optional(DumbTagParser('MW')) & \
Optional(Literal('<mul/>')) & \
DumbTagParser('L')
def attempt(s):
print 'Attempting:', s
match = record1.parseString(s, parseAll = True)
print 'Match: ', match
print
attempt('<L>1.1</L>')
attempt('<pc>Page1,1</pc> <pc>Page1,2</pc> <MW>000001</MW> <L>1.1</L>')
attempt('<mul/><MW>000003</MW><pc>1,1</pc><L>3.1</L>')
attempt('<mul/> <MW>000003</MW> <pc>1,1</pc> <L>3.1</L> ') # Note end space
Both parsers record1
and record2
fail, with different exceptions. With record1
, it fails on the last string (which differs from the penultimate string only in spaces):
pyparsing.ParseException: (at char 47), (line:1, col:48)
and with record2
, it fails on the penultimate string itself:
pyparsing.ParseException: Missing one or more required elements (Group:({"<" "L" ">" SkipTo:("</L>") "</L>"})) (at char 0), (line:1, col:1)
Now what is weird is that if I interchange lines 2 and 3 in the definition of record2
, then it parses fine!
record2 = Group(ZeroOrMore(DumbTagParser('pc'))).setResultsName('pcs') &\
Optional(Literal('<mul/>')) & \
Optional(DumbTagParser('MW')) & \
DumbTagParser('L') # parses my example strings fine
(Yes I realise that record2
doesn't contain any rule for <mat/>
. I'm trying to get a minimal example that reflects this sensitivity to reordering.)
I'm not sure if this is a bug in pyparsing or in my code, but my real question is how I should parse the kind of strings I want.