A bit of background: I'm implementing a regex matching engine (NFA) and it should support PCRE compatibility mode (I mean it should capture subexpressions with the same offsets as PCRE would do).
There's a test in PCRE's testinput1 which I can't fully understand. It tests lazy quantifiers.
So, the regex is
/<a[\s]+href[\s]*=[\s]* # find <a href=
([\"\'])? # find single or double quote
(?(1) (.*?)\1 | ([^\s]+)) # if quote found, match up to next matching
# quote, otherwise match up to next space
/isx
And the string is
<a href="abcd xyz pqr" cats
PCRE's match is:
<a href="abcd xyz pqr"
and it is obviously using the lazy quantifier.
As far as I understand, lazy quantifiers should not be used until another "greedy" ways are impossible at all. Now here's a possible greedy match:
<a href="abcd
which uses the negative branch of the conditional subpattern, no lazy quantifiers.
So I'm looking for an explanation of this PCRE's behaviour or any details/suggestions why the lazy quantifier matches in this test. Thanks!
EDIT: I also checked out how the TRE library works. It's a POSIX-compatible NFA engine. I modified the original regex a little bit to suit TRE's syntax:
#include <stdlib.h>
#include <stdio.h>
#include <tre/tre.h>
int main()
{
regex_t preg;
const char * regex = "<a[ ]+href[ ]*=[ ]*(?:(')(.*?)'|[^ ]+)";
const char * string = "<a href='abcd xyz pqr' cats";
int cflags = REG_EXTENDED;
int eflags = 0;
size_t nmatch = 3;
regmatch_t pmatch[100];
tre_regcomp(&preg, regex, cflags);
tre_regexec(&preg, string, nmatch, pmatch, eflags);
for (int i = 0; i < nmatch; i++) {
printf("%d: (%d, %d)\n", i, pmatch[i].rm_so, pmatch[i].rm_eo - pmatch[i].rm_so);
}
return 0;
}
and the output (using lengths instead of end offsets) is:
0: (0, 22)
1: (8, 1)
2: (9, 12)
So the suggestion about PCRE's backtracking-specific behaviour is most likely wrong...