4

I need to parse a file containing xml comments. Specifically it's a c# file using the MS /// convention.

From this I'd need to pull out foobar, or /// foobar would be acceptable, too. (Note - this still doesn't work if you make the xml all on one line...)

testStr = """
    ///<summary>
    /// foobar
    ///</summary>
    """

Here is what I have:

import pyparsing as pp

_eol = pp.Literal("\n").suppress()
_cPoundOpenXmlComment = Suppress('///<summary>') + pp.SkipTo(_eol)
_cPoundCloseXmlComment = Suppress('///</summary>') + pp.SkipTo(_eol)
_xmlCommentTxt = ~_cPoundCloseXmlComment + pp.SkipTo(_eol)
xmlComment = _cPoundOpenXmlComment + pp.OneOrMore(_xmlCommentTxt) + _cPoundCloseXmlComment

match = xmlComment.scanString(testStr)

and to output:

for item,start,stop in match:
    for entry in item:
        print(entry)

But I haven't had much success with the grammer working across multi-line.

(note - I tested the above sample in python 3.2; it works but (per my issue) does not print any values)

Thanks!

some bits flipped
  • 2,592
  • 4
  • 27
  • 42

3 Answers3

3

I think Literal('\n') is your problem. You don't want to build a Literal with whitespace characters (since Literals by default skip over whitespace before trying to match). Try using LineEnd() instead.

EDIT 1: Just because you get an infinite loop with LineEnd doesn't mean that Literal('\n') is any better. Try adding .setDebug() on the end of your _eol definition, and you'll see that it never matches anything.

Instead of trying to define the body of your comment as "one or more lines that are not a closing line, but get everything up to the end-of-line", what if you just do:

xmlComment = _cPoundOpenXmlComment + pp.SkipTo(_cPoundCloseXmlComment) + _cPoundCloseXmlComment 

(The reason you were getting an infinite loop with LineEnd() was that you were essentially doing OneOrMore(SkipTo(LineEnd())), but never consuming the LineEnd(), so the OneOrMore just kept matching and matching and matching, parsing and returning an empty string since the parsing position was at the end of line.)

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • thanks for the suggestion; however changing to `_eol=pp.LineEnd().suppress()` results in a hang/inf loop. Could you be a litte more specific (Note - paste the 3 sections together in one .py file and the code runs as-is). Thanks,Mike – some bits flipped Oct 19 '11 at 19:47
  • vote up for the explanation of what is wrong. Duh! I should have seen that I never consumed the end of line :) – some bits flipped Oct 22 '11 at 20:39
2

How about using nestedExpr:

import pyparsing as pp

text = '''\
///<summary>
/// foobar
///</summary>
blah blah
///<summary> /// bar ///</summary>
///<summary>  ///<summary> /// baz  ///</summary> ///</summary>    
'''

comment=pp.nestedExpr("///<summary>","///</summary>")
for match in comment.searchString(text):
    print(match)
    # [['///', 'foobar']]
    # [['///', 'bar']]
    # [[['///', 'baz']]]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1

You could use an xml parser to parse xml. It should be easy to extract relevant comment lines:

import re
from xml.etree import cElementTree as etree

# extract all /// lines
lines = re.findall(r'^\s*///(.*)', text, re.MULTILINE)

# parse xml
root = etree.fromstring('<root>%s</root>' % ''.join(lines))
print root.findtext('summary')
# -> foobar
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • I thought you were great in Blade Runner. – PaulMcG Oct 20 '11 at 01:47
  • @JFSebastian Unfortunately this wouldn't work in the bigger picture I'm encountering this problem in. yes, I could extract all the xml fragments as you suggest, but I need to also parse source code after the comment, and a grammer is ~necessary for that; doing the regex search line by line would add an additional loop through the file. – some bits flipped Oct 22 '11 at 20:29
  • 1
    @mike: the regex is just an example how to extract comment lines. In the bigger picture you use your parser to extract relevant comments (much simpler task than parsing xml) and it doesn't prevent you from using xml parser to parse xml if you find it necessary. – jfs Oct 22 '11 at 20:48