0

Here is my code:

echo "<br />";
preg_match_all("|<[^>]+>.*</[^>]+>|U",
    "<b>example:</b><strong>this is a test</strong>",
    $out, PREG_PATTERN_ORDER);
print_r($out);
echo "<br />";

echo "<br />";
preg_match_all("|<[^>]+>.*</[^>]+>|",
    "<b>example:</b><strong>this is a test</strong>",
    $out, PREG_PATTERN_ORDER);
print_r($out);
echo "<br />";

There is something I do not understand. What difference that is make when there is a U at the end of the regex?

The output is:

Array ( [0] => Array ( [0] => example: [1] => this is a test ) )

Array ( [0] => Array ( [0] => example:this is a test ) )

So what is happening here really? Which version is the greedy version and why?

Community
  • 1
  • 1
Koray Tugay
  • 22,894
  • 45
  • 188
  • 319
  • Also, how can I achieve the same results using the ? modifier? – Koray Tugay Jan 23 '13 at 17:41
  • 1
    **Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php.html for examples of how to properly parse HTML with PHP modules. – Andy Lester Jan 23 '13 at 17:49
  • @AndyLester Thanks, I am just trying to learn really.. Thanks though. – Koray Tugay Jan 23 '13 at 17:51

1 Answers1

2

The U tells your regular expression to be "Ungreedy". Greedy means to try to match as much as possible whereas "ungreedy" only takes the smallest match.

So in the greedy example your match is:

<b>example:</b><strong>this is a test</strong>

I assume the html tags "</b><strong>" are stripped away either when you output it or by the preg_match already.

In contrast the ungreedy does what you want by matching like this:

<b>example:</b>, <strong>this is a test</strong>

EDIT:

To achieve a similar match using the ? you can do:

preg_match_all("|<[^>/]+>.*?</[^>]+>|",
    "<b>example:</b><strong>this is a test</strong>",
    $out, PREG_PATTERN_ORDER);
print_r($out);

This is because .*? will try to limit the content in between the tag to be as short as possible (ungreedy), therefore again resulting in two matches.

mpaepper
  • 3,952
  • 3
  • 21
  • 28
  • Thank you. Can I achieve this by using ? also? – Koray Tugay Jan 23 '13 at 17:52
  • When you say Greedy means try to match as much as as possible, do you mean it tried to match the biggest match? Because when there is a U at the end -> it is ungreedy -> there are 2 matches. When there is no U -> it is greedy -> there is only 1 match? – Koray Tugay Jan 23 '13 at 17:55
  • 1
    @KorayTugay Greedy means take the biggest chunk, so it is able to take the whole chunk and thus only has one match. – mpaepper Jan 23 '13 at 17:56
  • Thanks.. Greedy: Get the biggest match, ungreedy, get smaller matches if exists. – Koray Tugay Jan 23 '13 at 17:58
  • 1
    @KorayTugay Be careful, though, the tags are part of your match, so also part of the $out variable. I guess your system strips the tags away, but they are in the match. – mpaepper Jan 23 '13 at 18:00
  • Yes, sorry, the text is actually bold in the browser. It is just an example from somewhere. Thanks. – Koray Tugay Jan 23 '13 at 18:01
  • Sorry to bother you, can I achieve the same thing with using the symbol '?' instead of using an U at the end? @mpaepper – Koray Tugay Jan 23 '13 at 18:09
  • @KorayTugay Yes: |<[^>/]+>.*?[^>]+>| The .*? Will try to limit the content in between to be as short as possible. Therefore you will get two matches. – mpaepper Jan 23 '13 at 18:12