PHP RegEx not matching a string that it should match

Question

This is driving me insane...

I have the following code:

    # open pdf
    $pdf = file_get_contents('myfile.pdf');

    echo("RE 1:\n");
    preg_match('/^[0-9]+ 0 obj.*\/Contents \[ ([0-9]+ [0-9]+) R \\]/msU', $pdf, $m);
    var_dump($m);

    echo("\nRE 2:\n");
    preg_match('/^8 0 obj.*\/Contents \[ ([0-9]+ [0-9]+) R \\]/msU', $pdf, $m);
    var_dump($m);

The file myfile.pdf contains the following text:

...
8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj
...

The only difference between those two regular expressions is the numeric range at the beginning of the string. Yet I get the following output:

RE 1:
array(0) {
}

RE 2:
array(2) {
  [0]=>
  string(78) "8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]"
  [1]=>
  string(3) "5 0"
}

I would expect both regular expressions to return similar results, but the regular expression with the numeric range at the start (RE 1) doesn't return any results. Is this a bug or am I doing something wrong?

Update

After adding preg_last_error(), I am getting PREG_BACKTRACK_LIMIT_ERROR. How can I fix that?

@Emma Yes, that is what I'm trying to capture. It works perfectly on regex101.com, but not in my code. — Ryan Steffer, Jul 29 '19 at 18:39
Both of your regexes work fine at http://sandbox.onlinephpfunctions.com/ so it could be that your PHP or PCRE version is causing a headache? — MonkeyZeus, Jul 29 '19 at 18:42
Try using [`preg_last_error()`](https://www.php.net/manual/en/function.preg-last-error.php) to see if it gives you any hints. — MonkeyZeus, Jul 29 '19 at 18:44
@MonkeyZeus Good call! I am getting PREG_BACKTRACK_LIMIT_ERROR. — Ryan Steffer, Jul 29 '19 at 18:52
Check your `php.ini` file and see what [`pcre.backtrack_limit`](https://www.php.net/manual/en/pcre.configuration.php#ini.pcre.backtrack-limit) is set to or use `echo ini_get( 'pcre.backtrack_limit' );` if you don't have access to `php.ini` — MonkeyZeus, Jul 29 '19 at 18:54
It's commented out: ;pcre.backtrack_limit=100000. So I assume it's using the default value of 100000. That seems quite high though, no? — Ryan Steffer, Jul 29 '19 at 18:57
One would hope, yes. [This answer](https://stackoverflow.com/a/9692029/2191572) alludes to an issue with brace nesting so maybe try `\]` instead of `\\]`? — MonkeyZeus, Jul 29 '19 at 18:58
@Thefourthbird fourth bird I am trying to capture the Contents string. It's coming up with no matches though? — Ryan Steffer, Jul 29 '19 at 22:19
It is in the first capturing group. Using preg_match_all for example https://3v4l.org/ddmRY Only use the `/m` flag — The fourth bird, Jul 29 '19 at 22:22
@Thefourthbird This is my script and output: https://pastebin.com/PrHD5vgv — Ryan Steffer, Jul 29 '19 at 22:42
I have used the example data from the question, perhaps the data from the pdf differs a bit. Is is always structured like that? Can you share some more data from the pdf? — The fourth bird, Jul 29 '19 at 22:45
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/197185/discussion-between-ryan-steffer-and-the-fourth-bird). — Ryan Steffer, Jul 29 '19 at 22:50
Perhaps a bit less restrictive https://regex101.com/r/5NdmS6/1 — The fourth bird, Jul 29 '19 at 23:06

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

1

I'm guessing that you might be designing an expression that would somewhat look like,

[0-9]+\s+0\s+obj\b.*?\/Contents\s+\[\s*([0-9]+\s+[0-9]+)\s+R\s*\]

on s mode.

Test

$re = '/[0-9]+\s+0\s+obj\b.*?\/Contents\s+\[\s*([0-9]+\s+[0-9]+)\s+R\s*\]/s';
$str = '8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj

8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

RegEx Circuit

jex.im visualizes regular expressions:

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 29 '19 at 18:40

Emma

27,428
11
44
69

1

OP is using `/msU` so their `.` matches everything including newlines. – MonkeyZeus Jul 29 '19 at 18:45
1

Yes, but that RE does work. And my output of preg_last_error() is PREG_BACKTRACK_LIMIT_ERROR. So that's why mine doesn't work I guess. But I'm not sure what causes that... – Ryan Steffer Jul 29 '19 at 18:54
1

Yours works with /msU. I'm wondering if it's the word boundary you used... Testing more things now. – Ryan Steffer Jul 29 '19 at 18:58
1

Ok, yours works because you added the .*? quantifier. I am already using the /U modifier, which means _Ungreedy_. But then your .*? reverses that. But I need it to be ungreedy, as there are many instances of this string I'm trying to capture. – Ryan Steffer Jul 29 '19 at 19:14
1

I see that. It will fail in regex101.com. But both are acceptable in PHP. I have tried both and they yield the exact same result. – Ryan Steffer Jul 29 '19 at 19:26
1

Yes, confusing to say the least. But PHP seems to interpret it either way. I guess because a \\ could be a literal \, but when it's before a ], it might assume you're trying to escape the ], so it just rolls it together. I don't know... – Ryan Steffer Jul 29 '19 at 19:32

PHP RegEx not matching a string that it should match

Update

1 Answers1

Test

RegEx Circuit