4

I'm trying to use pyparsing to parse quoted strings under the following conditions:

  • The quoted string might contain internal quotes.
  • I want to use backslashes to escape internal quotes.
  • The quoted string might end with a backslash.

I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).

Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?

Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):

import pyparsing as pp
import re

# A single-quoted string having:
#   - Internal escaped quote.
#   - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"

# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks   = parser.parseString(txt)
print
print 'txt:    ', txt
print 'pattern:', parser.pattern
print 'toks:   ', toks

# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)

Output:

txt:     'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks:    ["ab'"]

'ab\'cd\'

Update

Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.

Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\

# demo.txt
foo = 'ab\'cd\\'

My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.

with open('demo.txt') as fh:
    txt = fh.read().split()[-1].strip()

parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks   = parser.parseString(txt)
print
print 'txt:    ', txt
print 'pattern:', parser.pattern
print 'toks:   ', toks             # ["ab'cd\\\\"]

I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.

Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:

qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
FMc
  • 41,963
  • 13
  • 79
  • 132
  • How do you expect your example text to be parsed? It seems like an ill-formed quoted string to me, since there's only one unescaped quotation mark. – Blckknght Apr 26 '14 at 02:36
  • I'm not sure that is possible (or at least practical) in general. How do you intend to distinguish `ab\'cd\'` from `ab\'cd\' <100MB of other text> \'`? That is, how will you know in context whether an escaped quote is the end of the string or not? – BrenBarn Apr 26 '14 at 02:36
  • @Blckknght I would like it to be parsed the way my regex parses it. – FMc Apr 26 '14 at 02:50
  • @BrenBarn Maybe I'm overlooking something obvious ... but how is the 100MB of other text relevant? For example: when `txt = r"ab\'cd\' <100MB of other text> \'"`, my `rgx` parses the text just fine. Shouldn't pyparsing be able to do something similar with the `QuotedString` object? – FMc Apr 26 '14 at 02:54
  • @FMc: See my answer. Basically, your regex is okay if you know you're only parsing a single string. However, when you're actually trying to parse a whole document that may contain many such delimited strings, your approach can lead to confusing parses. – BrenBarn Apr 26 '14 at 03:50
  • If you have a working regex that does what you want, you can always use the pyparsing `Regex` class to make a parser element out of it. But I'll go back and look at what I did in `QuotedString` and see if your regex form is suitable for what it should parse. Off the top of my head, though, I'm disinclined to accept something that the Python parser wouldn't. – PaulMcG Apr 26 '14 at 06:20
  • @PaulMcGuire Thanks for the reply. If you have time, please see the update to the question. – FMc Apr 26 '14 at 13:19
  • Can you clarify your added example? You note that in Python it would parse to `"ab'cd\"`, but you apparently also want to handle `'ab\'cd\'`, and have it parse to the same thing, which is not allowed in Python. Also, it would be helpful if you could give some info on the larger grammar of the language apart from this string syntax. As I described in my answer, it is possible to parse these strings in isolation, but it's going to be at the least very confusing to parse them within a larger document, because of the ambiguity about where one ends and another begins. – BrenBarn Apr 27 '14 at 21:31
  • @BrenBarn Sorry for the confusion. The example string in the update expresses the problem more clearly; please ignore the initial example, which, as you and others helpfully noted, would not parse as a valid Python string. – FMc Apr 28 '14 at 01:24
  • @BrenBarn Regarding the rest of the grammar, I don't think it would help much (it's fairly simple). My main question at this point is whether pyparsing's `QuotedString` is able to parse a string like the example in the Update section above. If not, I can roll my own approach using pyparsing's `Regex` class. – FMc Apr 28 '14 at 01:27
  • note that i keep on finding this question when looking for "parsing a quoted string in python", while the answer i am looking for is simply to use shlex, see http://stackoverflow.com/questions/79968/split-a-string-by-spaces-preserving-quoted-substrings-in-python – anarcat Jun 19 '15 at 16:52

3 Answers3

3

I think you're misunderstanding the use of escQuote. According to the docs:

escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)

So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:

>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})

Notice that the backslash was replaced with '.

As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:

>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'

By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.

Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like

'one string \' 'or two'

we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • Exactly right on the purpose of `escQuote`. I implemented this argument for parsing strings inside SQL, in which quotes are escaped not with backslashes, but in doubling up the quotation marks (as in `"This is "" an embedded quote char"`). – PaulMcG Apr 26 '14 at 06:29
2

PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.

This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.

The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
2

What is it about this code that is not working for you?

from pyparsing import *

s = r"foo = 'ab\'cd\\'"  # <--- IMPORTANT - use a raw string literal here

ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue

results = strAssign.parseString(s)
print results.asList() # displays repr form of each element

for r in results:
    print r # displays str form of each element

# count the backslashes
backslash = '\\'
print results[-1].count(backslash)

prints:

['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2

EDIT:

So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:

import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))

Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.

I'll add this in the next patch release of pyparsing.

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • I need the following result: `['foo', '=', "ab'cd\\"]`. – FMc Apr 28 '14 at 14:11
  • Thanks for looking into this, and for the work on pyparsing generally. I've used it a lot and learned a great deal from it. – FMc Apr 28 '14 at 20:52