0

I'm looking for a way to dynamically surround parts of text with XML nodes based on regular expressions.

Consider the following example

<speak>The test number is 123456789, and some further block of text.</speak>

Now let's say I have a regular expression targeting the number to selectively surround it with a new tag so it would become:

<speak>The test number is <say-as interpret-as="characters">123456789</say-as>, and some further block of text.</speak>

I thought about using DomDocument for creating the tags, but not sure about the substitution part. Any advice?

Nicolas
  • 1,256
  • 1
  • 10
  • 22

4 Answers4

1

DOM is the correct way. It allows you to find and traverse text nodes. Use RegEx on the content of these nodes and build the new nodes up as a fragment.

function wrapMatches(\DOMNode $node, string $pattern, string $tagName, $tagAttributes = []) {
    $document = $node instanceof DOMDocument ? $node : $node->ownerDocument;
    $xpath = new DOMXpath($document);
    // iterate all descendant text nodes
    foreach ($xpath->evaluate('.//text()', $node) as $textNode) {
        $content = $textNode->textContent;
        $found = preg_match_all($pattern, $content, $matches, PREG_OFFSET_CAPTURE);
        $offset = 0;
        if ($found) {
            // fragments allow to treat multiple nodes as one
            $fragment = $document->createDocumentFragment();
            foreach ($matches[0] as $match) {
                list($matchContent, $matchStart) = $match;
                // add text from last match to current
                $fragment->appendChild(
                  $document->createTextNode(substr($content, $offset, $matchStart - $offset))
                );
                // add wrapper element, ...
                $wrapper = $fragment->appendChild($document->createElement($tagName));
                // ... set its attributes ...
                foreach ($tagAttributes as $attributeName => $attributeValue) {
                    $wrapper->setAttribute($attributeName, $attributeValue);
                }
                // ... and add the text content
                $wrapper->textContent = $matchContent;
                $offset = $matchStart + strlen($matchContent);
            }
            // add text after last match
            $fragment->appendChild($document->createTextNode(substr($content, $offset)));
            // replace the text node with the new fragment
            $textNode->parentNode->replaceChild($fragment, $textNode);
        }
    }
}


$xml = <<<'XML'
<speak>The test number is 123456789, and some further block of text.</speak>
XML;

$document = new DOMDocument();
$document->loadXML($xml);

wrapMatches($document, '(\d+)u', 'say-as', ['interpret-as' => 'characters']);

echo $document->saveXML();
ThW
  • 19,120
  • 3
  • 22
  • 44
  • Looking at the complexity of that I'm probably better off using preg_replace and a simple method for constructing the tag as a string. So perhaps the first one is the more elegant solution after all. – Nicolas Aug 16 '19 at 12:42
  • A simple regex does not take XML structure into consideration. It might work for some specific cases, but it is really fragile. – ThW Aug 16 '19 at 21:51
  • DOM is a very low-level way of doing this kind of transformation. XSLT is generally far less code. – Michael Kay Aug 17 '19 at 16:37
  • It is transforming text to XML. XSLT 1.0 has only limited features for this. (XSLT 2.0 requires something like Saxon/C). I am playing around with adding Xpath 2.0 functions to PHPs `ext/xsl`, actually: https://github.com/ThomasWeinert/XSLT-Functions/blob/master/examples/regexp/wrap-matches.php – ThW Aug 18 '19 at 20:10
1

This is conveniently handled using the xsl:analyze-string instruction in XSLT 2.0. For example you can define the rule:

<xsl:template match="speak">
  <xsl:analyze-string select="." regex="\d+">
    <xsl:matching-substring>
      <say-as interpret-as="characters">
        <xsl:value-of select="."/>
      </say-as>
    </xsl:matching-substring>
  </xsl:analyze-string>
</xsl:template>
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

You can use preg_replace something like this:

$str = '<speak>The test number is 123456789, and some further block of text.</speak>';
echo preg_replace('/(\d+)/','<say-as interpret-as="characters">$1</say-as>',$str);

and the output would be:

<speak>The test number is <say-as interpret-as="characters">123456789</say-as>, and some further block of text.</speak>
Saeed M.
  • 2,216
  • 4
  • 23
  • 47
  • I was hoping for a more elegant solution where I could use some type of xml generator for creating the tags and attributes, otherwise I'd have to create a whole bunch of additional logic for handling that manually. The above is just an example, the actual replacement rules are stored in database and could have different attributes. – Nicolas Aug 16 '19 at 09:34
  • If you change the input to be something like `$str = 'The test...`, then output will also replace the id attribute - `12">The test...` – Nigel Ren Aug 17 '19 at 18:46
  • @NigelRen It's up to the regex to be formed correctly and not let the replacement go rogue. Secondly, the tags can actually be added at the very end, which is how I currently do it so they're not present in the raw text being processed. – Nicolas Aug 19 '19 at 09:05
  • @Nicolas, that was just an example of how the solution could go wrong. Saying it works for one example is OK, but in 6 months time and other developers start working with it, they might not be aware of the pitfalls. You also have to remember that these answers are not just for your personal use, but something that others may use in their own situations. – Nigel Ren Aug 19 '19 at 09:18
0

I ended up doing it the simple way, since I don't need to handle nested nodes and other XML specific stuff. So just made a simple method for creating the tags as strings. It's good enough.

protected function createTag($name, $attributes = [], $content = null)
    {
        $openingTag = '<' . $name;

        if ($attributes) {
            foreach ($attributes as $attribute => $value) {
                $openingTag .= sprintf(' %s="%s"', $attribute, $value);
            }
        }

        $openingTag .= '>';

        $closingTag = '</' . $name . '>';

        $content = $content ?: '$1';

        return $openingTag . $content . $closingTag;
    }
$tag = $this->createTag($tagName, $attributes);

$text = preg_replace($regex, $tag, $text);

Nicolas
  • 1,256
  • 1
  • 10
  • 22