0

I am trying to load a simple HTML string, (which regardless of HTML-tidy) will not allow DOMDocument access.

Here is the instantiation

    $doc = new DOMDocument(/*'1.0', 'utf-8'*/);
    $doc->recover = true;
    $doc->strictErrorChecking = false;
    $doc->formatOutput = true;
    $doc->load($content);

    $node_array = $doc->getElementsByTagName("body");
    print_r( $node_array) 

...or $node_array->items(0);

I get:

DOMNodeList Object
(
)

DOMDocument returns the string just fine with the function save It is not a resource. Could it be missing dependencies, additional PHP configurations...?

Update: The objects of DOMDocument simply don't have any tostring conversion functions implemented:

    print_r( (string)$node_array );

Object of class DOMNodeList could not be converted to string in....


The HTML Code is here: http://pastebin.com/11V92Dup (intentionally malformed - this was to demonstrate in the code that 'tidy' properly closes the tags)

I would like to simply walk the nodes and output their content:

    $node_array = $doc->getElementsByTagName("html");//parent_node();
    $x = $doc->documentElement;
    foreach ($x->childNodes AS $item)
      {
      print $item->nodeName . " = " . $item->nodeValue . "<br />";
      }

UPDATE 2: I get this result! which doesn't make sense. (where do all the whitespaces come from?)

 body = 







                  COMPOUND: C05441
Jasper
  • 75,717
  • 14
  • 151
  • 146
Lorenz Lo Sauer
  • 23,698
  • 16
  • 85
  • 87
  • Sorry, but what exactly a question? Do you want get all body as string? if this is true and you want do this with DOMDocument you must clone first node and insert it to new DOMDocument. Like this - $node_arr = $doc->getElemenetsByTagName('body'); if ($node_arr->length){$new_dom = new DOMDocument; $new_dom->appendChild($node_arr-items(0)->cloneNode(true))}. But IU advice to much better use sunstring/strpos or regexp – ZigZag Sep 12 '11 at 12:10
  • White-spaces are causing by the HTML tags under body tag. What are you looking for? – ajreal Sep 12 '11 at 15:50
  • what sense would be in that? (PS: thanks, for answering - some voted down, without actually helping out) For instance the getter-property childNodes never seems do do change the internal pointer? – Lorenz Lo Sauer Sep 12 '11 at 15:56

1 Answers1

0

I'm not quite clear on what you're expecting for an answer. I'll give it a try anyway. Here's some code that recursively iterates over your HTML tree and outputs the textContent value of each element.

<?php

$contents = <<<HTML
<html><head>
<title>KEGG COMPOUND: C05441</title>
<link type="text/css" rel="stylesheet" href="/css/gn2.css">
<link rel="stylesheet" href="/css/bget.css" type="text/css">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta http-equiv="Content-Script-Type" content="text/javascript">
</head>
<body onload="window.focus();init();" bgcolor="#ffffff">
<table border=0 cellpadding=0 cellspacing=0><tr><td>
<table border="0" cellspacing="0" cellpadding="0" width="100%"><tr><td width="70"><a href="/kegg/kegg2.html"><img align="middle" border="0" src="/Fig/bget/kegg2.gif" alt="KEGG"></a></td><td>&nbsp;&nbsp;&nbsp;</td><td><a name="compound:C05441"></a><font class="title2">COMPOUND: C05441</font></td><td align="right" valign="bottom"><a href="javascript:void(window.open('/kegg/document/help_bget_compound.html','KEGG_Help','toolbar=no,location=no,directories=no,width=720,height=640,resizable=yes,scrollbars=yes'))"><img onmouseup="btn(this,'Hb')" align="middle" onmouseout="btn(this,'Hb')" onmousedown="btn(this,'Hbd')" onmouseover="btn(this,'Hbh')" alt="Help" name="help" border="0" src="/Fig/bget/button_Hb.gif"></a></td></tr></table>
<form method="post" action="/dbget-bin/www_bget" enctype="application/x-www-form-urlencoded" name="form1">
<table border=0 cellpadding=1 cellspacing=0>
<tr>
<td class="fr2">
<table border=0 cellpadding=2 cellspacing=0 style="border-bottom:#000 1px solid">

</table>
</body></html>
HTML;

$doc = new DOMDocument("1.0", "UTF-8");
$doc->loadHTML($contents);

header("Content-Type: text/plain; charset=utf-8");

function recursivelyEchoChildNodes (DOMElement $parent, $depth = 1) {
    foreach ($parent->childNodes as $node) {
        if ($node instanceof DOMElement) {
            echo str_repeat("-", $depth) . " " . $node->localName . " = " . $node->textContent . "\n";
            if ($node->hasChildNodes()) {
                recursivelyEchoChildNodes($node, $depth + 1);
            }
        }
    }
}

$html = $doc->getElementsByTagName("html")->item(0);
recursivelyEchoChildNodes($html);
matb33
  • 2,820
  • 1
  • 19
  • 28