Why do these two regexes yield different results in Notepad++?
//.*?\n|//.*$|\s+|.
(2 matches → screenshot)//.*?(?:\n|$)|\s+|.
(3 matches → screenshot)
Background
I'm writing a primitive lexer for Delphi in Perl. The purpose is to extract words (identifiers and keywords), it therefore doesn't need to properly recognize all kinds of tokens.
Its core is the following regex:
\{[^}]*\}|\(\*([^*]|\*[^\\])*?\*\)|[A-Za-z_]\w*|\d+|//.*?$|'([^']|'')*?'|\s+|.
What I found out by chance is that line endings where not consumed by line comments. So I was curious if I could modify the regex so that two consecutive lines consisting entirely by line comments got counted as 2 "tokens".
// first line
// last line
I replaced //.*?$
by //.*?\n
but with this regex a line comment placed directly before the EOF (without a newline) will not be matched, instead it's broken into /
, /
and so on. And so I searched for the right way to express the alternation correctly. I found two regexes that behave differently in Notepad++ and winGrep but same in Perl:
The actual difference was already shown in the introductory question:
\{[^}]*\}|\(\*([^*]|\*[^\\])*?\*\)|[A-Za-z_]\w*|\d+|//.*?\n|//.*?$|'([^']|'')*?'|\s+|.
(2 matches in above sample source)\{[^}]*\}|\(\*([^*]|\*[^\\])*?\*\)|[A-Za-z_]\w*|\d+|//.*?(?:\n|$)|'([^']|'')*?'|\s+|.
(3 matches in above sample source)
It can be observed in Notepad++ (7.7.1 32-bit) and grepWin (1.9.2 64-bit). In Perl, where I place the regexes between m@(
and )@mg
, there are 2 matches with both.