The best way to operate with HTML documents is using a parser.
In these examples I will use built-in DOMDocument
.
First of all, you have to init DOMDocument
and load HTML string:
$dom = new DOMDocument();
libxml_use_internal_errors( True );
$dom->loadHTML( $html );
libxml_use_internal_errors( False );
I use ->loadHTML
to load a string, but if your original HTML is in a file, you can directly use
$dom->loadHTMLFile( $yourFilePath );
To avoid annoying warnings about invalid HTML syntax,
I set libxml_use_internal_errors( True )
.
Example 1: Delete all nodes with ‘section’ tag:
$nodes = $dom->getElementsByTagName( 'section' );
while( $nodes->length )
{
$nodes->item(0)->parentNode->removeChild( $nodes->item(0) );
}
With ->getElementsByTagName( 'section' )
I get all document's nodes with tag section
, then — in the while
loop — I delete each node. Note that I use while
instead of foreach
, because (if I have two section
node, i.e.) when I delete first node, second node become first, and the following foreach
loop will fail. As alternative, I can use a decrementing for
loop.
Example 2: Delete node by ID:
if( $node = $dom->getElementById( 'footer-widget-wysija-1' ) )
{
$node->parentNode->removeChild( $node );
}
ID is unique by definition, so ->getElementById()
return only one element: if it is found, I can delete it using ->removeChild()
Output HTML:
Finally, to output resulting HTML, you have to use
echo $dom->saveHTML();