3

A bit of background: I'm implementing a regex matching engine (NFA) and it should support PCRE compatibility mode (I mean it should capture subexpressions with the same offsets as PCRE would do).

There's a test in PCRE's testinput1 which I can't fully understand. It tests lazy quantifiers.

So, the regex is

/<a[\s]+href[\s]*=[\s]*          # find <a href=
 ([\"\'])?                       # find single or double quote
 (?(1) (.*?)\1 | ([^\s]+))       # if quote found, match up to next matching
                                 # quote, otherwise match up to next space
/isx

And the string is

<a href="abcd xyz pqr" cats

PCRE's match is:

<a href="abcd xyz pqr"

and it is obviously using the lazy quantifier.

As far as I understand, lazy quantifiers should not be used until another "greedy" ways are impossible at all. Now here's a possible greedy match:

<a href="abcd

which uses the negative branch of the conditional subpattern, no lazy quantifiers.

So I'm looking for an explanation of this PCRE's behaviour or any details/suggestions why the lazy quantifier matches in this test. Thanks!

EDIT: I also checked out how the TRE library works. It's a POSIX-compatible NFA engine. I modified the original regex a little bit to suit TRE's syntax:

#include <stdlib.h>
#include <stdio.h>
#include <tre/tre.h>

int main()
{
    regex_t preg;
    const char * regex = "<a[ ]+href[ ]*=[ ]*(?:(')(.*?)'|[^ ]+)";
    const char * string = "<a href='abcd xyz pqr' cats";
    int cflags = REG_EXTENDED;
    int eflags = 0;
    size_t nmatch = 3;
    regmatch_t pmatch[100];

    tre_regcomp(&preg, regex, cflags);
    tre_regexec(&preg, string, nmatch, pmatch, eflags);

    for (int i = 0; i < nmatch; i++) {
        printf("%d: (%d, %d)\n", i, pmatch[i].rm_so, pmatch[i].rm_eo - pmatch[i].rm_so);
    }

    return 0;
}

and the output (using lengths instead of end offsets) is:

0: (0, 22)
1: (8, 1)
2: (9, 12)

So the suggestion about PCRE's backtracking-specific behaviour is most likely wrong...

  • I once made the mistake of having a look at how it's done. **SCARY!** As far as I can see, full PCRE requires a stack evaluation model and not an automata-theoretic one. The closest you can get with automata-based approaches is likely the RE engine used in PostgreSQL and Tcl, but that's really really complicated. – Donal Fellows Jan 16 '14 at 11:30
  • Thanks for the quick response. But it seems not to be a backtracking specific behaviour (see edit) and can be implemented using automata (although I know that TRE fails some tests for POSIX compliance from AT&T Test Data Repository and, thus, this behaviour may be kind of a "bug"). – Valeriy Streltsov Jan 16 '14 at 12:52

2 Answers2

1

First of all I am just a beginner in the REGEX world. So, I am sorry if this answer is wrong or if I misunderstood the question.

Reading this definition taken from this book Regular Expressions Cookbook:

(?(1)then|else) is a conditional that checks whether the first capturing group has already matched something. If it has, the regex engine attempts to match then. If the capturing group has not participated in the match attempt thus far, the else part is attempted.

  • With this subject: <a href="abcd xyz pqr" cats

    The first capturing group has matched the first " character. So, the expected behaviour is to attempt to match the then part. The second capturing group in the then part manages to match the string abcd xyz pqr with (.*?) and finally the then part manages to match abcd xyz pqr" with (.*?)\1. The REGEX may finish with success.

    So, the else part with its greddy quantifier is not required, indeed it is not used. It is as if the greddy quantifier never existed.

  • With this subject: <a href="abcd

    The first capturing group has matched the " character. Now the then part manages to match the string abcd with (.*?) but it will never match the last " character because there is not such character at the end of the subject. The conditional fails.

    The REGEX engine does not stop here, you have used ([\"\'])? so, the engine may try again because the " character is optional and it keeps going on as if the first capturing group had not matched (indeed there is backtrack). So, now the engine reachs the conditional with no match for the first capturing group, the else part is attempted and it manages to match the string "abcd (the " character is not matched by the first capturing group because of the backtrack and now it is matched by the third capturing group in the else part) The REGEX may finish with success.

PS: I am learning for fun stuff about regular expressions, so probably this answer is totally wrong. Wait for a better answer.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Gooseman
  • 2,102
  • 17
  • 16
0

I don't fully understand your question here, but the non-greedy quantifier allows to search to first occurrence of the pattern. With pcretest, you try the greedy and non-greedy forms as well on the same input.

Non-greedy form:

  re> /<a[\s]+href[\s]*=[\s]*([\"\'])?(?(1)(.*?)\1|([^\s]+))/i
  data> <a href="ab"cd"
    0: <a href="ab"
    1: "
    2: ab

Greedy form:

 re> /<a[\s]+href[\s]*=[\s]*([\"\'])?(?(1)(.*)\1|([^\s]+))/i
 data> <a href="ab"cd"
    0: <a href="ab"cd"
    1: "
    2: ab"cd
dark100
  • 87
  • 2