everyone! In the book Regular Expressions Cookbook written by Jan Goyvaerts, Steven Levithan, in Chapter 3.22 there is a parser example. In short, you need to write a parser that accept two types of information:
- Keywords: "table", "row" and "cell".
- Content aka % quoted strings. Inside a string double
%%
counts as a%
symbol.
Symbols outside 1) and 2) are errors.
There are also some rules, most important are:
- keywords and content can be unseparated: expression
cell%c1%cell%d1%
is valid. - a keyword and a keyword must be separated with space symbols:
row cell
- content 1 and content 2 must be separated with space symbols:
%c1% %c2%
In Java code regexp solution is:
Pattern regex = Pattern.compile(
" \\b(table|row|cell)\\b\n" +
"| %([^%]*(?:%%[^%]*)*)%\n" +
"| (\\S+)",
Pattern.CASE_INSENSITIVE | Pattern.COMMENTS);
This solution forms 3 save groups: keywords, % quoted strings and errors. In Java and many other languages |
symbol works from left to right, so symbols not matched with left alternative matches with next and so on.
But this is not a case with PostgreSQL! In PostgreSQL documentation says: "An RE consisting of two or more branches connected by the | operator is always greedy". So it doesn't matter in which order alternatives stand, the result will be the match with maximum symbols.
So the result of match with equivalent expression in PostgreSQL might be:
select regexp_matches('cell%c1%cell%d1%',
' \y(table|row|cell)\y | %([^%]*(?:%%[^%]*)*)% | (\S+)','gix');
------------------------------
{NULL,NULL,cell%c1%cell%d1%}
(1 row)
but I expect:
regexp_matches
----------------
{cell,NULL}
{NULL,c1}
{cell,NULL}
{NULL,d1}
(4 rows)
The last expression naturally "eats" the whole string because it's largest match.
Flag g = show all matches.
Flag i = Pattern.CASE_INSENSITIVE.
Flag x = Pattern.COMMENTS.
So my question: is there any equivalence in PostgreSQL of this Java regexp?