4

Given the code:

$my_str = '
Rollo is*
My dog*
And he\'s very*
Lovely*
';

preg_match_all('/\S+(?=\*$)/m', $my_str, $end_words);
print_r($end_words);

In PHP 7.3.2 (XAMPP) I get the unexpected output

Array ( [0] => Array ( ) )

Whereas in PhpFiddle, on PHP 7.0.33, I get what I expected:

Array ( [0] => Array ( [0] => is [1] => dog [2] => very [3] => Lovely ) )

Why am I getting this difference? Did something change in regular expression behaviour after 7.0.33?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mitya
  • 33,629
  • 9
  • 60
  • 107
  • 3
    A useful site for testing if something is a version difference, rather than platform or configuration, is https://3v4l.org In this case [it shows the expected output for all versions](http://3v4l.org/YfSSW) so there is some other difference in your test environments. My guess is something related to Windows vs Unix line-endings. – IMSoP Mar 12 '19 at 13:21
  • 2
    I can't reproduce your issue when testing the above code. [Here's a demo](https://3v4l.org/YfSSW). That tests 7.1.25 - 7.3.3 and gives the expected results. It even works if you check "eol versions", which tests all versions from 4.3 – M. Eriksson Mar 12 '19 at 13:23
  • 1
    Using 7.3.3 via the command line I'm seeing the same failure (empty array). – Dave Mar 12 '19 at 13:27
  • I tested through CLI on both 7.3.2 and 7.3.3 on an Ubuntu machine and It still gives me the expected result. – M. Eriksson Mar 12 '19 at 13:46
  • Interesting indeed ... I'm using Windows FWIW. PHP 7.3.3 (cli) (built: Mar 6 2019 21:53:23) ( ZTS MSVC15 (Visual C++ 2017) x64 – Dave Mar 12 '19 at 13:50
  • What have I uncovered? ;-) Not sure what to say, except it's definitely happening. – Mitya Mar 12 '19 at 14:29
  • Try `'~(*ANYCRLF)\S+(?=\*$)~'` – Wiktor Stribiżew Mar 12 '19 at 14:32
  • Not sure what that should do, @WiktorStribiżew, but that produces `array(1) { [0]=> array(1) { [0]=> string(6) "Lovely" } } ` – Mitya Mar 12 '19 at 14:37
  • Yeah, forgot `m`, `'~(*ANYCRLF)\S+(?=\*$)~m'` - it should provide consistent output across versions – Wiktor Stribiżew Mar 12 '19 at 15:17
  • Hmm, you're right, it does. But how to explain the behaviour I found? And what is `(*ANYCRLF)`? – Mitya Mar 13 '19 at 11:16
  • @WiktorStribiżew bump... – Mitya Mar 19 '19 at 12:13
  • @Utkanos It is easy: without the PCRE verb, the `$` only matches before an LF symbol. Your line endings are CRLF, so the behavior of `$` must be redefined. – Wiktor Stribiżew Mar 19 '19 at 12:16
  • 1
    @WiktorStribiżew I wouldn't call that easy or obvious, but thanks :-) – Mitya Mar 19 '19 at 12:53

1 Answers1

1

It seems that in the environment you have, the PCRE library was compiled without the PCRE_NEWLINE_ANY option, and $ in the multiline mode only matches before the LF symbol and . matches any symbol but LF.

You can fix it by using the PCRE (*ANYCRLF) verb:

'~(*ANYCRLF)\S+(?=\*$)~m'

(*ANYCRLF) specifies a newline convention: (*CR), (*LF) or (*CRLF) and is equivalent to PCRE_NEWLINE_ANY option. See the PCRE documentation:

PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be recognized.

In the end, this PCRE verb enables . to match any character but a CR and LF symbols and $ will match right before either of these two characters.

See more about this and other verbs at rexegg.com:

By default, when PCRE is compiled, you tell it what to consider to be a line break when encountering a . (as the dot it doesn't match line breaks unless in dotall mode), as well the ^ and $ anchors' behavior in multiline mode. You can override this default with the following modifiers:

(*CR) Only a carriage return is considered to be a line break
(*LF) Only a line feed is considered to be a line break (as on Unix)
(*CRLF) Only a carriage return followed by a line feed is considered to be a line break (as on Windows)
(*ANYCRLF) Any of the above three is considered to be a line break
(*ANY) Any Unicode newline sequence is considered to be a line break

For instance, (*CR)\w+.\w+ matches Line1\nLine2 because the dot is able to match the \n, which is not considered to be a line break. See the demo.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • How very strange. The only thing I can think is that, when I installed XAMPP, I deselected 'install Perl'. I know PCRE derives from Perl, so could that be what's caused this? – Mitya Mar 19 '19 at 12:53
  • @Utkanos I don't believe it had any impact. The issue is with how the PCRE library was compiled. Note PCRE regex library is not the same as the one used in Perl. – Wiktor Stribiżew Mar 19 '19 at 12:55