Slicing HTML based on delimiter

Question

I am converting Word docs on the fly to HTML and needing to parse said HTML based on a delimiter. For example:

<div id="div1">
    <p>
        <font>
            <b>[[delimiter]]Start of content section 1.</b>
        </font>
    </p>
    <p>
        <span>More content in section 1</span>
    </p>
</div>
<div id="div2">
    <p>
        <b>
            <font>[[delimiter]]Start of section 2</font>
        </b>
    <p>
    <span>More content in section 2</span>
    <p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
    <span><font>More content in section 3</font></span>
</div>
<!-- This continues on... -->

Should be parsed as:

Section 1:

<div id="div1">
    <p>
        <font>
            <b>[[delimiter]]Start of content section 1.</b>
        </font>
    </p>
    <p>
        <span>More content in section 1</span>
    </p>
</div>

Section 2:

<div id="div2">
    <p>
        <b>
            <font>[[delimiter]]Start of section 2</font>
        </b>
    <p>
    <span>More content in section 2</span>
    <p></p>
<div>

Section 3:

<div id="div2">
    <p>
        <b>

        </b>
    <p>
    <p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
    <span><font>More content in section 3</font></span>
</div>

I can't simply "explode"/slice based on the delimiter, because that would break the HTML. Every bit of text content has many parent elements.
I have no control over the HTML structure and it sometimes changes based on the structure of the Word doc. An end user will import their Word doc to be parsed in the application, so the resulting HTML will not be altered before being parsed.
Often the content is at different depths in the HTML.
I cannot rely on element classes or IDs because they are not consistent from doc to doc. #div1, #div2, and #div3 are just for illustration in my example.
My goal is to parse out the content, so if there's empty elements left over that's OK, I can simply run over the markup again and remove empty tags (p, font, b, etc).

My attempts:

I am using the PHP DOM extension to parse the HTML and loop through the nodes. But I cannot come up with a good algorithm to figure this out.

$doc = new \DOMDocument();
$doc->loadHTML($html);
$body = $doc->getElementsByTagName('body')->item(0);

foreach ($body->childNodes as $child) {
    if ($child->hasChildNodes()) {
        // Do recursive call...
    } else {
        // Contains slide identifier?
    }
}

I don't think this is achievable unless you have certain `div`'s that you can target, for example by ID. If you could rely on that it's simple to get everything between the opening and closing tag of a certain ID (e.g. `#div1`, `#div2`, etc), and that is the content you want. However, you can't just say look for *any* `div` because that's a universal tag which may even appear inside other `div`'s etc. You always have to define rules for sections of content, which is impossible if you neither control the markup, or cannot rely on it never changing. — Andy, Aug 22 '17 at 15:59
why not to use `strip_tags` and then output text to some template `
text>
text
` — vadim_hr, Aug 29 '17 at 13:44
Why not use something like this library: https://github.com/ATofighi/phpQuery — online Thomas, Aug 31 '17 at 07:50

Hugo Delsing · Answer 1 · 2017-08-31T07:36:22.207

In order to solve an issue like this, you first need to work out the steps needed to get a solution, before even starting to code.

Find an element that starts with [[delimiter]]
Check if it's parent has a next sibling
No? Repeat 2
Yes? This next sibling contains the content.

Now once you put this to work, you are already 90% ready. All you need to do is clean up the unnecessary tags and you're done.

To get something that you can extend on, don't build one mayor pile of obfuscated code that works, but split all the data you need in something you can work with.

Below code works with two classes that does exactly what you need, and gives you a nice way to go trough all the elements, once you need them. It does use PHP Simple HTML DOM Parser instead of DOMDocument, because I like it a little better.

<?php
error_reporting(E_ALL);
require_once("simple_html_dom.php");

$html = <<<XML
<body>
        <div id="div1">
                <p>
                        <font>
                                <b>[[delimiter]]Start of content section 1.</b>
                        </font>
                </p>
                <p>
                        <span>More content in section 1</span>
                </p>
        </div>
        <div id="div2">
                <p>
                        <b>
                                <font>[[delimiter]]Start of section 2</font>
                        </b>
                </p>
                <span>More content in section 2</span>
                <p>
                        <font>[[delimiter]]Start of section 3</font>
                </p>
        </div>
        <div id="div3">
                <span>
                        <font>More content in section 3</font>
                </span>
        </div>
</body>
XML;



/*
 * CALL
 */

$parser = new HtmlParser($html, '[[delimiter]]');

//dump found
//decode/encode to only show public values
print_r(json_decode(json_encode($parser)));


/*
 * ACTUAL CODE
 */


class HtmlParser
{
    private $_html;
    private $_delimiter;
    private $_dom;

    public $Elements = array();

    final public function __construct($html, $delimiter)
    {
        $this->_html = $html;
        $this->_delimiter = $delimiter;
        $this->_dom = str_get_html($this->_html);

        $this->getElements();
    }

    final private function getElements()
    {
        //this will find all elements, including parent elements
        //it will also select the actual text as an element, without surrounding tags
        $elements = $this->_dom->find("[contains(text(),'".$this->_delimiter."')]");

        //find the actual elements that start with the delimiter
        foreach($elements as $element) {
            //we want the element without tags, so we search for outertext
            if (strpos($element->outertext, $this->_delimiter)===0) {
                $this->Elements[] = new DelimiterTag($element);
            }
        }

    }

}

class DelimiterTag
{
    private $_element;

    public $Content;
    public $MoreContent;

    final public function __construct($element)
    {
        $this->_element = $element;
        $this->Content = $element->outertext;


        $this->findMore();
    }

    final private function findMore()
    {
        //we need to traverse up until we find a parent that has a next sibling
        //we need to keep track of the child, to cleanup the last parent
        $child = $this->_element;
        $parent = $child->parent();
        $next = null;
        while($parent) {
            $next = $parent->next_sibling();

            if ($next) {
                break;
            }
            $child = $parent;
            $parent = $child->parent();
        }

        if (!$next) {
            //no more content
            return;
        }

        //create empty element, to build the new data
        //go up one more element and clean the innertext
        $more = $parent->parent();
        $more->innertext = "";

        //add the parent, because this is where the actual content lies
        //but we only want to add the child to the parent, in case there are more delimiters
        $parent->innertext = $child->outertext;
        $more->innertext .= $parent->outertext;

        //add the next sibling, because this is where more content lies
        $more->innertext .= $next->outertext;

        //set the variables
        if ($more->tag=="body") {
            //Your section 3 works slightly different as it doesn't show the parent tag, where the first two do.
            //That's why i show the innertext for the root tag and the outer text for others.
            $this->MoreContent = $more->innertext;
        } else {
            $this->MoreContent = $more->outertext;
        }

    }
}




?>

Cleaned up output:

stdClass Object
(
  [Elements] => Array
  (
    [0] => stdClass Object
    (
        [Content] => [[delimiter]]Start of content section 1.
        [MoreContent] => <div id="div1">
                            <p><font><b>[[delimiter]]Start of content section 1.</b></font></p>
                            <p><span>More content in section 1</span></p>
                          </div>
    )

    [1] => stdClass Object
    (
        [Content] => [[delimiter]]Start of section 2
        [MoreContent] => <div id="div2">
                            <p><b><font>[[delimiter]]Start of section 2</font></b></p>
                            <span>More content in section 2</span>
                         </div>
    )

    [2] => stdClass Object
    (
        [Content] => [[delimiter]]Start of section 3
        [MoreContent] => <div id="div2">
                            <p><font>[[delimiter]]Start of section 3</font></p>
                         </div>
                         <div id="div3">
                            <span><font>More content in section 3</font></span>
                          </div>
    )
  )
)

Not sure how we are supposed to deal with extra content. For example if there are extra `
` tags between delimiters, are they supposed to be part of the content? — Nigel Ren, Aug 30 '17 at 19:54
@NigelRen In my experience with parsing open office/excel files, dumb users are pretty smart in messing things up. The chances of an actual parsing system that works in less then 100 lines of codes and hardly any exceptions is near zero. That's why I build it with classes and split all the data into separate classes, so I can extend each part easier. Because like you said, there will be a lot of extra `
`, `` and other tags, especially in `Microsoft` generated `HTML`. — Hugo Delsing, Aug 31 '17 at 07:32
I've come to the decision that it's an interesting theoretical exercise, but a practical nightmare. You would need probably hundreds of examples to ensure that any solution worked and then as you say someone comes along with example 101 which breaks the code again. — Nigel Ren, Aug 31 '17 at 08:30
I guess that sounds about right. But it does depend on what the documents contain. If you try to parse resumes you will get a thousand different versions and its hard. If you send a form as a word document that people need to fill, you might get a very high success rating and could save a lot of time. But as with all user parsing, it's filled with exceptions. Not to mention the problems between different versions of Word. — Hugo Delsing, Aug 31 '17 at 08:50
TBH - without OP's input, it's difficult to even validate the basic assumptions that have been made about this document. Even the basics of the 'parent having next sibling' logic of tracking back up the document may be oversimplifying the possible combinations. — Nigel Ren, Aug 31 '17 at 20:31

Nigel Ren · Answer 2 · 2017-08-23T07:20:34.543

The nearest I've got so far is...

$html = <<<XML
<body>
    <div id="div1">
        <p>
            <font>
                <b>[[delimiter]]Start of content section 1.</b>
            </font>
        </p>
        <p>
            <span>More content in section 1</span>
        </p>
    </div>
    <div id="div2">
        <p>
            <b>
                <font>[[delimiter]]Start of section 2</font>
            </b>
        </p>
        <span>More content in section 2</span>
        <p>
            <font>[[delimiter]]Start of section 3</font>
        </p>
    </div>
    <div id="div3">
        <span>
            <font>More content in section 3</font>
        </span>
    </div>
</body>
XML;
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");

foreach ($div as $child) {
    echo "Div=".$doc->saveHTML($child).PHP_EOL;
}

echo "Last bit...".$doc->saveHTML($child).PHP_EOL;
$div = $xp->query("following-sibling::*", $child);
foreach ($div as $remain) {
    echo $doc->saveHTML($remain).PHP_EOL;
}

I think I had to tweak the HTML to correct a (hopefully) erroneous missing </div>.

It would be interesting to see how robust this is, but difficult to test.

The 'last bit' attempts to take the element with the last marker in in ( in this case div2) till the end of the document (using following-sibling::*).

Also note that it assumes that the body tag is the base of the document. So this will need to be adjusted to fit your document. It may be as simple as changing it to //body...

update With a bit more flexibility and the ability to cope with multiple sections in the same overall segment...

$html = <<<XML
    <html>
    <body>
        <div id="div1">
            <p>
                <font>
                    <b>[[delimiter]]Start of content section 1.</b>
                </font>
            </p>
            <p>
                <span>More content in section 1</span>
            </p>
        </div>
        <div id="div1a">
            <p>
                <span>More content in section 1</span>
            </p>
        </div>
        <div id="div2">
            <p>
                <b>
                    <font>[[delimiter]]Start of section 2</font>
                </b>
            </p>
            <span>More content in section 2</span>
            <p>
                <font>[[delimiter]]Start of section 3</font>
            </p>
        </div>
        <div id="div3">
            <span>
                <font>More content in section 3</font>
            </span>
        </div>
    </body>
    </html>
XML;

$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("//body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");

$partCount = $div->length;
for ( $i = 0; $i < $partCount; $i++ )  {
    echo "Div $i...".$doc->saveHTML($div->item($i)).PHP_EOL;

    // Check for multiple sections in same element
    $count = $xp->evaluate("count(descendant::*[contains(text(),'[[delimiter]]')])",
            $div->item($i));
    if ( $count > 1 )   {
        echo PHP_EOL.PHP_EOL;
        for ($j = 0; $j< $count; $j++ ) {
            echo "Div $i.$j...".$doc->saveHTML($div->item($i)).PHP_EOL;
        }
    }
    $div = $xp->query("following-sibling::*", $div->item($i));
    foreach ($div as $remain) {
        if ( $i < $partCount-1 && $remain === $div->item($i+1)  )   {
            break;
        }
        echo $doc->saveHTML($remain).PHP_EOL;
    }

    echo PHP_EOL.PHP_EOL;
}

Thank you very much for your answer! I will give it a shot. I was unaware of the querying ability. — user8488500, Aug 22 '17 at 20:22
I've added some new code, still difficult to test as you say the content is very dynamic. But this tries to bridge my earlier version with something which gives all the in between content. Using the `following-sibling` method until it reaches the next element which it has identified as having a section delimeter. — Nigel Ren, Aug 23 '17 at 07:22

Slicing HTML based on delimiter

2 Answers2