Alternative regex to get contents for a xml tag

Question

I'm processing a XML file and I need to get all content inside <section> tags.

Right now I'm using this regex:

<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/i', $myXmlString, $results);?>

The code inside the <section> tags is pretty complex. It include math equations and stuff like that. In my local machine the regex works perfect. It is php 5.3.10 over apache 2.2.22 (Ubuntu)

BUT in my staging server it doesn't work. It is php 5.3.3 over apache 2.2.15 (Red Hat)

I would ask 2 questions:

Is there any issue with preg_match_all for php 5.3.3?

Is there a better way to express the regex?

--EDIT: VARIATIONS OF REGEX USED UNSUCCESSFULY--

<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/is', $myXmlString, $results);?>
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>(.*?)<\/section>#ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>([^\00]*?)<\/section>#ims', $myXmlString, $results);?>

--EDIT: Why haven't I used a parser?

The XML consists of two <sections>. Each section groups n questions for an exam.

Each question can include math equations represented by its own XML. An equation may be something like this:

<inlineequation><m:math baseline="-16.5" display="inline" overflow="scroll"><m:mrow><m:mtable columnalign="left"><m:mtr><m:mtd><m:mrow><m:mo stretchy="true">[</m:mo><m:mrow><m:mtable columnalign="right"><m:mtr><m:mtd><m:mn>4</m:mn></m:mtd><m:mtd columnalign="right"><m:mrow><m:mo>-</m:mo><m:mn>9</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mrow><m:mn>54</m:mn></m:mrow></m:mtd></m:mtr><m:mtr><m:mtd columnalign="right"><m:mrow><m:mo>&minus;</m:mo><m:mn>28</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mo>&minus;</m:mo><m:mn>1</m:mn></m:mtd><m:mtd columnalign="right"><m:mo>&minus;</m:mo><m:mn>14</m:mn></m:mtd></m:mtr></m:mtable></m:mrow><m:mo stretchy="true">]</m:mo></m:mrow></m:mtd></m:mtr></m:mtable></m:mrow></m:math></inlineequation>

I need that code to remain XML (no array) because I will pass that code as it is to a jQuery plugin which will render the equation (it will look like LaTeX equations).

If I parse the XML it will be really difficult to create the string for the equation again and locate it in the right place inside the question's statement.

Why don't you use a xml parser? Parsing XML with regex has some problems, like, [sanity](http://stackoverflow.com/a/1732454/938236). — Francisco Presencia, Jan 22 '14 at 02:53
And the code at hand wouldn't work on either version due to the unescaped delimiter. — mario, Jan 22 '14 at 02:58
Also, did you bother [reading the documentation](http://es1.php.net/preg_match_all)? There's a specific point for PHP 5.3.6 which you seemed to miss. — Francisco Presencia, Jan 22 '14 at 02:58
It fails on PHP 5.3.3 no 5.3.6. My first approach was to work with a parser, but inside the sections there is a lot of code I need to remain as XML since it will be interpreted by a jQuery plugin to render math equations. — sepelin, Jan 22 '14 at 03:18
[These things are harder than you might think at first, second, or even third glance](http://stackoverflow.com/q/4231382/471272). — tchrist, Jun 08 '14 at 20:14

score 1 · Answer 1 · answered Jan 22 '14 at 03:00

1

regex can be resource intensive.

perhaps consider using xml_parse_into_struct;

<?php
    $xmlp = xml_parser_create();
    xml_parse_into_struct($xmlp, $myXmlString, $vals, $index);
    xml_parser_free($xmlp);
    print_r($vals);
?>

answered Jan 22 '14 at 03:00

flauntster

2,008
13
20

thanks @flauntster. I edited the question to answer why I can't use the parser. – sepelin Jan 22 '14 at 05:19

score 0 · Answer 2 · answered Jan 22 '14 at 04:40

As others have said, don't use regex to parse XML. Having said that, let's answer your actual question:

Is it at all likely that your XML document contains line breaks? Do you realise that the . character will match everything except line-breaks unless you explicitly turn this feature on?

Try this:

<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/si', $myXmlString, $results);?>

The extra s at the end, tells the regex engine to allow . to match line-breaks.

Honestly though, a lot of people get too hung up on "not parsing XML with regex" without actually thinking about why it's a bad idea. Performance aside, it's essentially because there's no proper way of dealing with nested tags - there's more to it than that, but this is basically what it boils down to. XML documents are not regular so you can't use regular expressions to parse them.

HOWEVER! Sometimes the data that you want to get out of an XML document definitely IS regular. If you throw away the fact that you're dealing with XML for a moment and treat it as just a string of text - you can establish definite patterns that you ABSOLUTELY can use regex to pull out.

In your case, I'd say it's a safe bet that your XML document has a flat structure; there wouldn't be tags nested inside other tags for example. In that case, if we forget the XML component and just think about the patterns you've got

Unmatched text
Pattern that denotes the start of a match
Matched text
Patten that denotes the end of a match
Unmatched text
etc ...

This is absolutely regular and - save for some insane edge cases I wouldn't bother worrying about - it's pretty damned safe!

Thanks man. I edited the question to include the variations of regex I have already tried and why I need to use regular expresions instead of parser. — sepelin, Jan 22 '14 at 05:21

Alternative regex to get contents for a xml tag

2 Answers2