How to cleanup HTML source part?

Question

As a searchresult I get the content around the search word. But this is just a subpart of the whole page and it includes just the tags which are near the searchword. If the matching (opening/ending) is farther away I stay with unbalanced HTML-tags. These unbalanced tags could bread the page layout as the browser tries to balance it uses tags from complete other level.

example

this might be the whole page:

<li>
  <h3>Ang my oniuse.</h3> 
  <p>Oh! any or said faing ear Dand and tion on so wor st wouter and abox 
  a makess stand he he sne at mon the nany ing a me come hink floney a 
  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat seelectler</h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his
  thend that ance, he ned and me lood says wou hed set pidays far it
  conted, and seell yarty.</p>
</li>

searching for seelectler might result in a HTML part like:

  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat <b>seelectler</b></h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his

Now the p tag and the li tags are unbalanced and with the closing tags the browser tries to close the p tag, which might be around the whole found text, and the li-tag which might be around each found entry.
But the next opening of these tags have the wrong css-classes and some div tags between li and p are now unmatched and the closing at the end may close div-tags from column layout.

Result: the complete page layout is broken.

The wished result could be either (all unpaired tags are paired, this can not be foolproof):

<li><p>
  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat <b>seelectler</b></h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his
</p></li>

or:

  naiday. Smiler yousee lurneremiley boll his a grog.
  <h3>I'l hat <b>seelectler</b></h3> 
  Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his

but this solution might loose important layout e.g. linebreaks.

Does there exist a viewhelper which can cleanup unbalanced HTML-tags either with adding the missing parts or with removing the remained parts?
Is there an algorithm / regexp for detecting unbalanced tags?

I think its html parser issue in the index_search extension. it will return result with html tag. you have to format result manually. Check this answer https://stackoverflow.com/questions/43848632/indexed-search-extbase-htmltags-in-output?answertab=active#tab-top. Hope this will help you! — Geee, Jun 23 '17 at 10:00

score 0 · Answer 1 · answered Jun 23 '17 at 09:44

0

i would suggest stripping all html-tags from the search result. and use plaintext search results.

might create some minor "formatting" by replacing certain tags with linebreaks.

answered Jun 23 '17 at 09:44

Wolffc

1,176
6
9

Bernd Wilke πφ · Answer 2 · 2017-06-26T13:08:40.287

the nearest solution I have found is with this viewhelper:

<?php
namespace MyCompany\MyExtension\ViewHelpers;

use TYPO3\CMS\Fluid\Core\ViewHelper\AbstractViewHelper;

/**
 * fills in missing xml tags
 */
class BalanceXmlViewHelper extends AbstractViewHelper
{

    /**
     * balances XML-fragment with additional tags
     *
     * @param string $xmlIn
     * @return string
     */
    public function render($xmlIn = null)
    {
        if (null === $xmlIn) {
            $xmlIn = $this->renderChildren();
        }

        $xmlDoc = new \DOMDocument();
        // it's UTF-8 data!
        $xmlDoc->loadHTML('<?xml encoding="UTF-8">' . $xmlIn
              // we want no complete HTML-document, so neglect some default-tags
            , LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_NOWARNING | LIBXML_NOXMLDECL
        );

        // remove the additional charset tag and replace german umlauts
        $retVal = html_entity_decode(mb_substr($xmlDoc->saveHTML(),23)
                                    ,ENT_COMPAT | ENT_HTML401
                                    );


        return $retVal;
    }
}

I know it could stay with invalid tags (e.g. LI-tags without UL), but it is more precise than removing all tags (stripHTML()), which results in text without linebreaks or even whitespace after removing of block tags.

How to cleanup HTML source part?

2 Answers2