0

Why do these two regexes yield different results in Notepad++?

  1. //.*?\n|//.*$|\s+|. (2 matches → screenshot)
  2. //.*?(?:\n|$)|\s+|. (3 matches → screenshot)

Background

I'm writing a primitive lexer for Delphi in Perl. The purpose is to extract words (identifiers and keywords), it therefore doesn't need to properly recognize all kinds of tokens.

Its core is the following regex:

\{[^}]*\}|\(\*([^*]|\*[^\\])*?\*\)|[A-Za-z_]\w*|\d+|//.*?$|'([^']|'')*?'|\s+|.

What I found out by chance is that line endings where not consumed by line comments. So I was curious if I could modify the regex so that two consecutive lines consisting entirely by line comments got counted as 2 "tokens".

// first line
// last line

I replaced //.*?$ by //.*?\n but with this regex a line comment placed directly before the EOF (without a newline) will not be matched, instead it's broken into /, / and so on. And so I searched for the right way to express the alternation correctly. I found two regexes that behave differently in Notepad++ and winGrep but same in Perl:

The actual difference was already shown in the introductory question:

  1. \{[^}]*\}|\(\*([^*]|\*[^\\])*?\*\)|[A-Za-z_]\w*|\d+|//.*?\n|//.*?$|'([^']|'')*?'|\s+|. (2 matches in above sample source)

  2. \{[^}]*\}|\(\*([^*]|\*[^\\])*?\*\)|[A-Za-z_]\w*|\d+|//.*?(?:\n|$)|'([^']|'')*?'|\s+|. (3 matches in above sample source)

It can be observed in Notepad++ (7.7.1 32-bit) and grepWin (1.9.2 64-bit). In Perl, where I place the regexes between m@( and )@mg, there are 2 matches with both.

Wolf
  • 9,679
  • 7
  • 62
  • 108
  • 1
    Cf. https://regex101.com/r/Orqud1/1 and https://regex101.com/r/Orqud1/2: they match your sample strings the same. – Wiktor Stribiżew Aug 14 '19 at 13:19
  • @WiktorStribiżew could it be that the flags `gm` differ from `gms`? I used Notepad++ and winGrep to count matches ... Seems I have to find a sample input that shows the problem. Sorry for now. – Wolf Aug 14 '19 at 13:23
  • 1
    `//.*?\n|//.*?$` will run the `\n` portion, fail and then run the `$` portion, since it doesn't find a newline, but `//.*?(?:\n|$)` will not. That is the difference in behavior (afaik) but I can't find a difference in results. The `s` flag will let `.` match `\n` which does not change the result in this situation. – Elizabeth Aug 14 '19 at 13:25
  • As an aside if you are using the `m` flag then `$` will assert position at end of line so you shouldn't need both unless you need the `\n` to be consumed by your match. – Elizabeth Aug 14 '19 at 13:26
  • @EthanJ. this is what I seem to observe. Is there a difference in handling nested alternations? And **yes**: I want there `\n` (if the is any) to be consumed by match. – Wolf Aug 14 '19 at 13:26
  • There's no contextual difference to how `(?:\n|$)` based on the fact that it's inside another alteration, as far as I know. I'll be really stumped if you replace it with `(?:(?=\n)|$)` and still see a difference in behavior, since the thing that sets these two tokens apart is that `\n` consumes a character. Sample input in a regex101 link is the only way anyone's going to be able to help out though. – Elizabeth Aug 14 '19 at 13:34
  • @EthanJ. I see the difference in Notepad++ and winGrep, but **no difference in Perl**. I tried `(?:(?=\n)|$)` and `(?:[\n]|$)` in Notepad++ and grepWin and the difference (3 vs. 2) remains. BTW: the example above is useful. – Wolf Aug 14 '19 at 13:46
  • @Wolf my best guess is that `\s+` is matching the newline. – Elizabeth Aug 14 '19 at 13:52
  • 1
    @Wolf `s` is a singleline modifier, you should not use it with the patterns like `//.*$` as you only want to match chars other than line break chars. – Wiktor Stribiżew Aug 14 '19 at 13:56
  • @WiktorStribiżew thanks for the hint, just removed from my Perl script (and question). What remains is the problem with winGrep and Notepad++ but I seemingly just hit the border of comparability of Perl and these tools, that are nevertheless great for checking regexes. – Wolf Aug 14 '19 at 14:00
  • Why do you say the problem is with Notepad++? If you want to make `.` match newlines, just enable this option in the SR dialog. Or add `(?s)` at the start of the pattern. – Wiktor Stribiżew Aug 14 '19 at 14:01
  • @WiktorStribiżew this check (`.` matches newlines) is enabled in Notepad++. – Wolf Aug 14 '19 at 14:03
  • @WiktorStribiżew I see 3 matches with the newest Npp (32-bit) version, see [screenshot](https://i.stack.imgur.com/h3MPa.png) also linked in question. – Wolf Aug 14 '19 at 14:18
  • 1
    Ok, `//.*?\n|//.*?$` and `//.*?(?:\n|$)` show 2 matches. I suspect the `\s+` later on matches a CR, carriage return char that is before `\n`. – Wiktor Stribiżew Aug 14 '19 at 14:26
  • @WiktorStribiżew Your comment helped me to make the problem easier to reconstruct. I updated the question accordingly. Obviously the line comment itself where not the problem. – Wolf Aug 14 '19 at 14:46
  • 1
    Since you are using `. Matches newline` may I suggest `\/\/[^\r\n]*`? https://regex101.com/r/VEc6Dl/1 – MonkeyZeus Aug 14 '19 at 14:59
  • @MonkeyZeus interesting point, `//[^\n\r]*(?:[\n\r]+|$)|\s+|.` and `//[^\n\r]*[\n\r]+|//[^\n\r]*$|\s+|.` (the slashes don't need to be escaped) are consistent in Notepad++ and winGrep. All yield 2 matches with above sample input. I don't get why this is. Maybe you can explain it to me (hopefully in an answer)? – Wolf Aug 14 '19 at 15:19
  • I'm really not enough of a regex expert to tell you why Notepad++ seems to be misbehaving with `//.*?(?:\n|$)|\s+|.` but I am confused as to why you kept the non capturing group when you tried my suggestion... – MonkeyZeus Aug 14 '19 at 15:28
  • Also, not sure if you've accounted for this but what about something like `stringVar := 'Whoops! // comment incoming!!!!';`? My primitive regex certainly fails: https://regex101.com/r/VEc6Dl/2. Even worse is if you have escaped quotes in a quoted string. – MonkeyZeus Aug 14 '19 at 15:32
  • @MonkeyZeus I'm capturing all alternatives in a first step then I see if the match is a word (identifier, reserved word) - there is still a minor omission in my question I'll add now. Thanks for your help – Wolf Aug 14 '19 at 15:32
  • Hmm well good luck to you. I'm no regex expert and never used Delphi so that's about as much help as I can give. – MonkeyZeus Aug 14 '19 at 15:35
  • @MonkeyZeus seems [I finally found out](https://stackoverflow.com/a/57554519/2932052) the cause, thanks to your hint concerning `\r\n` :) – Wolf Aug 19 '19 at 10:14
  • You're welcome, I've made a habit of using `[\r\n]*` together like that so I honestly didn't notice that you were only accounting for `\n`. However, your claim that Perl converts all line breaks into simply `\n` is a bit surprising but that's good to know in case I ever get into Perl! – MonkeyZeus Aug 19 '19 at 13:13

1 Answers1

0

Windows Line Break Anatomy

The observed difference between Perl and the external tools is caused by the difference between \r\n and \n. If you read a text file in Perl, the newline character (sequence) gets translated into \n which is one character, so \n matches this char as the line break.

In Notepad and grepWin, this translation is not carried out. So //.*?(?:\n|$) never consumes the newline sequence, it instead stops at its beginning (right between e and \r) where the regex engine matches $, the \r remains in the input; the \s+ then matches the whole newline sequence (\r\n).

enter image description here

//.*?\n on the other hand matches the \r with a . and after that the \n.

If you change the newline in the pattern into \r\n for the external tools, both alternatives give two matches:

  • //.*?\r\n|//.*$|\s+|.

  • //.*?(?:\r\n|$)|\s+|.

Wolf
  • 9,679
  • 7
  • 62
  • 108