5

I have a string value that I'm trying to extract list items for. I'd like to extract the text and any subnodes, however, DOMDocument is converting the entities to the character, instead of leaving in the original state.

I've tried setting DOMDocument::resolveExternals and DOMDocument::substituteEntities for false, but this has no effect. It should be noted I'm running on Win7 with PHP 5.2.17.

Example code is:

$example = '<ul><li>text</li>'.
    '<li>&frac12; of this is <strong>strong</strong></li></ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));
    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;
}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        $innerHTML .= $child->ownerDocument->saveXML( $child );
    }

    return $innerHTML;
}

&frac12; ends up getting converted to ½ (single character / UTF-8 version, not entity version), which is not the desired format.

Reuben
  • 4,136
  • 2
  • 48
  • 57
  • How are you determining the conversion took place? Are you displaying the results in HTML? – Phil Sep 08 '11 at 05:06
  • With an echo (the real code is a bit more complicated). I'll update the example code with the echos that I'm using at the moment. The echo'd results are being output to a log file. Results are being displayed in Textpad (like Notepad), and not HTML. – Reuben Sep 08 '11 at 05:10
  • How are you loading the `$example` string into the `DOMDocument`? – Phil Sep 08 '11 at 05:24
  • 5.3.6 - http://www.php.net/manual/en/domdocument.savehtml.php (This support `$doc->saveHTML( new DOMNode('½') );` – ajreal Sep 08 '11 at 05:36
  • @Phil. There's something to be said for making sure example code actually works before putting it up. But it actually works. – Reuben Sep 08 '11 at 05:48
  • @ajreal, I was hoping to avoid upgrading PHP, just for that feature. I guess the work around for PHP 5.2.X is to use saveHTMLFile, then load and strip the DOCTYPE. Nasty. – Reuben Sep 08 '11 at 05:51
  • 1
    @ajreal I tried `saveHTML(DOMNode $node)` in 5.3.8 and it still translates the entity. – Phil Sep 08 '11 at 05:53
  • Sorry, Don't have 5.3.6++ to test. How about `$doc->saveHTML( new DOMText('½') )` – ajreal Sep 08 '11 at 05:55

3 Answers3

5

Solution for not PHP 5.3.6++

$html =<<<HTML
<ul><li>text</li>
<li>&frac12; of this is <strong>strong</strong></li></ul>
HTML;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('li') as $node)
{
  echo htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)), "\n";
}
ajreal
  • 46,720
  • 11
  • 89
  • 119
  • It treats ½ correctly, but strips . I might try something where _get_inner_html() recognises the reference between DOMElement and DOMText, and uses an appropriate function to convert (either htmlentities or a recursive call). – Reuben Sep 08 '11 at 06:17
3

Based on the answer provided by ajreal, I've expanded the example variable to handle more cases, and changed _get_inner_html() to make recursive calls and handle the entity conversion for text nodes.

It's probably not the best answer, since it makes some assumptions about the elements (such as no attributes). But since my particular needs don't require attributes to be carried across (yet.. I'm sure my sample data will throw that one at me later on), this solution works for me.

$example = '<ul><li>text</li>'.
'<li>&frac12; of this is <strong>strong</strong></li>'.
'<li>Entity <strong attr="3">in &frac12; tag</strong></li>'.
'<li>Nested nodes <strong attr="3">in &frac12; <em>tag &frac12;</em></strong></li>'.
'</ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = true;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));

    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;

}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        echo 'Node type is '.$child->nodeType.PHP_EOL;
        switch ($child->nodeType) {
        case 3:
            $innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
            break;
        default:
            echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
            echo 'Node name '.$child->nodeName.PHP_EOL;
            $innerHTML .= '<'.$child->nodeName.'>';
            $innerHTML .= _get_inner_html( $child );
            $innerHTML .= '</'.$child->nodeName.'>';
            break;
        }
    }

    return $innerHTML;
}
Community
  • 1
  • 1
Reuben
  • 4,136
  • 2
  • 48
  • 57
  • Use ISO-8859-1//TRANSLIT or ISO-8859-1//IGNORE to avoid notices, and having the string truncated for characters that don't convert successfully. For example, presence of `™` resulted in a notice, and was converted to `TM` with the //TRANSLIT option. – Reuben Sep 09 '11 at 02:53
-1

Need no iterate child nodes:

function innerHTML($node)
         {$html=$node->ownerDocument->saveXML($node);
          return preg_replace("%^<{$node->nodeName}[^>]*>|</{$node->nodeName}>$%", '', $html);
         }
diyism
  • 12,477
  • 5
  • 46
  • 46
  • What replaces the htmlentites(iconv()) call in this example? It looks like it only strips the outer tag. – Reuben Jul 16 '12 at 05:07