0

In an attempt to match the single first node in an xml document, using the regex

~<(\S+).*>.*</\1>~, it matches nothing until the text is a certain length. In one document, after I had stripped away text until it was 1186 characters, the regex successfully found something. In the following example, I stripped away text until it was only 960 characters, and then the regex was successful. As you can imagine, this seemingly inconsistent behavior is very confusing. I would appreciate any information on why this is occurring.

Original text:

<?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Corets, Eva</author> <title>Oberon's Legacy</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-03-10</publish_date> <description>In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</description> </book> <book id="bk105"> <author>Corets, Eva</author> <title>The Sundered Grail</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-09-10</publish_date> <description>The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.</description> </book> <book id="bk106"> <author>Randall, Cynthia</author> <title>Lover Birds</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-09-02</publish_date> <description>When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled.</description> </book> <book id="bk107"> <author>Thurman, Paula</author> <title>Splish Splash</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-11-02</publish_date> <description>A deep sea diver finds true love twenty thousand leagues beneath the sea.</description> </book> <book id="bk108"> <author>Knorr, Stefan</author> <title>Creepy Crawlies</title> <genre>Horror</genre> <price>4.95</price> <publish_date>2000-12-06</publish_date> <description>An anthology of horror stories about roaches, centipedes, scorpions and other insects.</description> </book> <book id="bk109"> <author>Kress, Peter</author> <title>Paradox Lost</title> <genre>Science Fiction</genre> <price>6.95</price> <publish_date>2000-11-02</publish_date> <description>After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.</description> </book> <book id="bk110"> <author>O'Brien, Tim</author> <title>Microsoft .NET: The Programming Bible</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-09</publish_date> <description>Microsoft's .NET initiative is explored in detail in this deep programmer's reference.</description> </book> <book id="bk111"> <author>O'Brien, Tim</author> <title>MSXML3: A Comprehensive Guide</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-01</publish_date> <description>The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more.</description> </book> <book id="bk112"> <author>Galos, Mike</author> <title>Visual Studio 7: A Comprehensive Guide</title> <genre>Computer</genre> <price>49.95</price> <publish_date>2001-04-16</publish_date> <description>Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.</description> </book> </catalog>

Trimmed (successful) text:

<?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Core</catalog>

I apologize for the formatting of the texts, but I do not want to put something in the data to make it behave differently for others (like new line characters).

EDIT: I have been testing the regex using this site.

MirroredFate
  • 12,396
  • 14
  • 68
  • 100
  • 3
    Why the regex? Just use one of the 4 or 5 xml parsers that PHP has built in – nice ass Jul 15 '13 at 16:53
  • 2
    FOR SCIENCE! Look, this isn't about xml parsers, this is about php's regex exhibiting odd behavior. Either there is something wrong with my regex, or there is something wrong with php's regex implementation, or there is an option or something I of which I am unaware. I just want to know why it's behaving as it is. – MirroredFate Jul 15 '13 at 16:58
  • 2
    Maybe you can try something with `pcre.backtrack_limit` (using `ini_set`)? – Pieter Jul 15 '13 at 17:03
  • 1
    I believe what Pieter is getting at, in case it's not clear, is that your expression might be so "flexible" that it needs to attempt a horrific number of paths (and backtracking) before being fully satisfied that it has _not_ found a match, and you may be encountering a limit (either a backtracking limit or a memory/time limit) before that occurs. – Andrew Cheong Jul 15 '13 at 17:05
  • That's interesting. I didn't know the backtrack_limit could be set... Although, it looks like the default is far higher than what I am using...? EDIT: That may be the case, acheong87. I tested it without the nested subpatterns and it works just fine with as many characters as I want. I shall play around with the backtrack_limit. – MirroredFate Jul 15 '13 at 17:06
  • 3
    As far as I can tell, the problem lies with the first `.*` being greedy. Try putting a `?` afterwards like `.*?` to make the regex match only up to the `>`, or else it will match up to the last `>` (meaning pretty much the entire xml document). – Jonathan Kuhn Jul 15 '13 at 17:07
  • 1
    @JonathanKuhn Been there, done that. :/ – MirroredFate Jul 15 '13 at 17:10
  • every introductory regex tutorial about repetition explains how this works. the moment you wonder, be honest to yourself, confess what you *know* and what you *guess* and then read about the parts you want to learn more. – hakre Jul 15 '13 at 17:31
  • possible duplicate of [PHP: unexpected PREG\_BACKTRACK\_LIMIT\_ERROR](http://stackoverflow.com/questions/9691627/php-unexpected-preg-backtrack-limit-error) – hakre Jul 15 '13 at 17:38
  • 1
    @MirroredFate: [I've left you an answer that shows what you've done wrong](http://stackoverflow.com/a/17660389/367456). It does *not* show how to "fix" your pattern (most likely you want to use an XML parser anyway), however it contains two links to related Q&A material here on site which themselves contain more reference(s). – hakre Jul 15 '13 at 17:40

3 Answers3

2

The function preg_match() has - similar to many other PHP functions - a return value.

Depending on what that return value is, you can base the decision how the script should go on.

In you're case you're missing to actually check the return value being FALSE. Because - as your example shows, it is FALSE.

Reading the manual suggests that the return value of FALSE signals an error. You can learn more about that error by calling the function preg_last_error() which gives the last error code. So you can learn about the error your call to preg_match() gives:

int(2) - PREG_BACKTRACK_LIMIT_ERROR

See as well:

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
1

You can have a better control of your quantifiers using constraignant character classes:

example with a lazy quantifier:

$pattern = '~<([^>\s]++)[^>]*+>.*?</\1>~';

example with only possessive quantifiers (much better):

$pattern = '~<([^>\s]++)[^>]*+>(?>[^<]++|<(?!/\1>))+</\1>~';

But these two patterns don't deal with nested structures, to do that you must use:

$pattern = '~<([^>/\s]++)[^>]*+>(?>[^<]++|(?R))*</\1>~';



details:

second pattern: (?>[^<]++|<(?!/\1>))+

(?>           # open an atomic group
   [^<]++     # all characters but < one or more times (possessive)
  |           # OR
   <(?!/\1)   # < not followed by / and the content of the first backreference
              #  (the tag name here)
)+            # close the atomic group and repeat one or more times

the goal of this is to match all until </\1>, the idea is to match all that is not a < or all < not followed by /tagname>

More informations about possessive quantifiers and atomic groups.


third pattern: the recursive pattern

<                                
  ([^>/\s]++)     # tagname, 
                  # note that you must exclude the / to avoid closing tags
  [^>]*+          # leading characters in the tag
>


(?>               # open an atomic group
   [^<]++         # all characters but <, one or more times (possessive)
  |               # OR
   (?R)           # repeat the whole pattern
)*                # close the atomic group, repeat zero or more times

</\1>             # close tag with the first back reference
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Could you by any chance explain a little bit of what is going on? This works beautifully, but I have no idea what is happening (at least in the second one). – MirroredFate Jul 15 '13 at 17:39
-1

Well, first of all - the general attitude is that XML should not be parsed with RegEx. Use SimpleXML instead, if possible. And as nickb has said, way too greedy...

MBaas
  • 7,248
  • 6
  • 44
  • 61
  • 1
    While the only use case example I have of this occurrence is using xml as text, that does not mean that it cannot happen with other texts. This isn't really an answer- it's a cop out. – MirroredFate Jul 15 '13 at 17:15
  • Yeah, sorry - I was focussed more on "getting the job done" than on the details of the regex... – MBaas Jul 15 '13 at 17:29