I am converting Word docs on the fly to HTML and needing to parse said HTML based on a delimiter. For example:
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
<p>
<span>More content in section 2</span>
<p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
<!-- This continues on... -->
Should be parsed as:
Section 1:
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
Section 2:
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
<p>
<span>More content in section 2</span>
<p></p>
<div>
Section 3:
<div id="div2">
<p>
<b>
</b>
<p>
<p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
I can't simply "explode"/slice based on the delimiter, because that would break the HTML. Every bit of text content has many parent elements.
I have no control over the HTML structure and it sometimes changes based on the structure of the Word doc. An end user will import their Word doc to be parsed in the application, so the resulting HTML will not be altered before being parsed.
Often the content is at different depths in the HTML.
I cannot rely on element classes or IDs because they are not consistent from doc to doc. #div1, #div2, and #div3 are just for illustration in my example.
My goal is to parse out the content, so if there's empty elements left over that's OK, I can simply run over the markup again and remove empty tags (p, font, b, etc).
My attempts:
I am using the PHP DOM extension to parse the HTML and loop through the nodes. But I cannot come up with a good algorithm to figure this out.
$doc = new \DOMDocument();
$doc->loadHTML($html);
$body = $doc->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child) {
if ($child->hasChildNodes()) {
// Do recursive call...
} else {
// Contains slide identifier?
}
}
text>
text