1

This is driving me insane...

I have the following code:

    # open pdf
    $pdf = file_get_contents('myfile.pdf');

    echo("RE 1:\n");
    preg_match('/^[0-9]+ 0 obj.*\/Contents \[ ([0-9]+ [0-9]+) R \\]/msU', $pdf, $m);
    var_dump($m);

    echo("\nRE 2:\n");
    preg_match('/^8 0 obj.*\/Contents \[ ([0-9]+ [0-9]+) R \\]/msU', $pdf, $m);
    var_dump($m);

The file myfile.pdf contains the following text:

...
8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj
...

The only difference between those two regular expressions is the numeric range at the beginning of the string. Yet I get the following output:

RE 1:
array(0) {
}

RE 2:
array(2) {
  [0]=>
  string(78) "8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]"
  [1]=>
  string(3) "5 0"
}

I would expect both regular expressions to return similar results, but the regular expression with the numeric range at the start (RE 1) doesn't return any results. Is this a bug or am I doing something wrong?

Update

After adding preg_last_error(), I am getting PREG_BACKTRACK_LIMIT_ERROR. How can I fix that?

Community
  • 1
  • 1
Ryan Steffer
  • 425
  • 6
  • 10

1 Answers1

1

I'm guessing that you might be designing an expression that would somewhat look like,

[0-9]+\s+0\s+obj\b.*?\/Contents\s+\[\s*([0-9]+\s+[0-9]+)\s+R\s*\]

on s mode.

Test

$re = '/[0-9]+\s+0\s+obj\b.*?\/Contents\s+\[\s*([0-9]+\s+[0-9]+)\s+R\s*\]/s';
$str = '8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj

8 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 6 0 R
/Contents [ 5 0 R ]
>>
endobj';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    OP is using `/msU` so their `.` matches everything including newlines. – MonkeyZeus Jul 29 '19 at 18:45
  • 1
    Yes, but that RE does work. And my output of preg_last_error() is PREG_BACKTRACK_LIMIT_ERROR. So that's why mine doesn't work I guess. But I'm not sure what causes that... – Ryan Steffer Jul 29 '19 at 18:54
  • 1
    Yours works with /msU. I'm wondering if it's the word boundary you used... Testing more things now. – Ryan Steffer Jul 29 '19 at 18:58
  • 1
    Ok, yours works because you added the .*? quantifier. I am already using the /U modifier, which means _Ungreedy_. But then your .*? reverses that. But I need it to be ungreedy, as there are many instances of this string I'm trying to capture. – Ryan Steffer Jul 29 '19 at 19:14
  • 1
    I see that. It will fail in regex101.com. But both are acceptable in PHP. I have tried both and they yield the exact same result. – Ryan Steffer Jul 29 '19 at 19:26
  • 1
    Yes, confusing to say the least. But PHP seems to interpret it either way. I guess because a \\ could be a literal \, but when it's before a ], it might assume you're trying to escape the ], so it just rolls it together. I don't know... – Ryan Steffer Jul 29 '19 at 19:32