Can regexes containing nongreedy (reluctant) quantifiers be rewritten to use only greedy ones?

Question

Suppose I have a regex language supporting literals, positive and negative character classes, ordered alternation, and the greedy quantifiers ?, *, and +. (This is essentially a subset of PCRE without backreferences, look-around assertions, or some of the other fancier bits.) Does adding the nongreedy quantifiers ??, *?, and +? increase the expressive power of this formalism?

Put another way, given a pattern S containing a nongreedy quantifier, can that pattern be rewritten to an equivalent pattern T which contains no nongreedy quantifiers?

If this question has been considered in the literature, I'd appreciate any references which anyone can provide. I was able to turn up almost no theoretical work on the expressive power of extended regex formalisms (beyond the usual things about how backreferences move you from regular languages to context-free grammars).

Your question could be rewritten as "Does such a pattern exist that contains greedy quantifiers that cannot be transformed into a pattern without greedy quantifiers?" Or the assertion "no pattern exists that ..." Taken this way, you're looking for a counter-example. Your question might get further, faster by providing example greedy patterns, asserting that they are not transformable to meet your criteria, and seeing what people come up with. Then again, people might think they can answer your full hypothesis by solving each test... — Merlyn Morgan-Graham, Jul 20 '11 at 18:40
Wouldn't this type of question be more appropriate for http://cstheory.stackexchange.com/ — Mat, Jul 20 '11 at 18:48
@Mat: Yes, I'd forgotten that the cstheory site existed. I've [asked](http://cstheory.stackexchange.com/questions/7464/can-regexes-containing-nongreedy-reluctant-quantifiers-be-rewritten-not-to-use) there. — uckelman, Jul 21 '11 at 12:14

score 2 · Answer 1 · answered Jul 20 '11 at 19:28

When you say "Regex", you're refering to a several techniques - this isn't just an issue of the underlying theory. Consider the question "does this string match my given Regex?" For such a question, the notion of "greedy" is merely an implementation detail - if you're using one of the common (but inefficient) backtracking implementations, this can affect performance but not output. Similarly, the question "does this string contain a match?" is not affected by greedy vs. non-greedy quantifiers. This first type of regex is concerning with the abstract notion of set-membership: defining the language of matching strings.

So why do non-greedy quantifiers even exist? Regexes are not merely used for matching; common implementations can locate where the match is and which parts of the regex match which parts of the output. By doing this, a user depends on the intricacies of the implementation, which isn't trivial. This second type of regex is concerned with getting a few bits of text into a more practical representation withing the context of an otherwise turing-complete language.

Generally, when you talk of the strength of the regular expression formalism, you're talking about the first world - in which the computer answers with a simple yes or no. It's easy to talk about because the specification is clear. When you talk of a greedy vs. non-greedy quantifier, you're talking about the second world - used lots in practice, but with a specification that's mostly grown without much planning to solve real problems and is a standard by virtue of backwards compatibility. This second world is solving entirely different problems, and it's not clear to me what "expressive power" should even mean here. Certainly, non-greedy can be practical; and that's kind of the point...

Non-greedy quantifiers don't do anything for the first type of expressiveness, and they do for the second, although it's not quite clear what "expressive power" means here anyhow.

This *is* an issue of theory, not of implementation. Example: The languages of the patterns `a+` and `a+?` are not the same. L(`a+`) is the set of all finite runs of the letter `a`, while L(`a+?`) = { `a` }. — uckelman, Jul 21 '11 at 09:20
`a+?` matches `aaa` just fine - the non-greedy part simply means it'll match `a` first (again, this difference is about searching, not matching). You can demonstrate that behavior in a typical implementation by searching for `^a+?$` - which will return all a's - and clearly any consumed `a` in the input must then have been matched by the `a+?` sub-matcher. — Eamon Nerbonne, Jul 21 '11 at 17:17
The reason I'm disagreeing here is that it's no longer the same pattern once you add the anchors. — uckelman, Jul 22 '11 at 10:33
@uckelman: Can you clarify what you mean when you write that L(`a+?`) = { `a` }? Taken literally, that's obviously not true, since `a+?` will happily match any sequence of one or more `a`'s -- it just *prefers* to match as few as possible. — ruakh, May 19 '14 at 00:28
I should revise: I find that the usual definitions of match and the language of a pattern are infelicitous when dealing with nongreedy quantification. The equivalence I'm interested in is *search* equivalence, not language equivalence. `a+?` will never match more than one `a` in a search, even though `L(a+?)` contains `a`. — uckelman, May 21 '14 at 10:50

Can regexes containing nongreedy (reluctant) quantifiers be rewritten to use only greedy ones?

1 Answers1