-2

As a searchresult I get the content around the search word. But this is just a subpart of the whole page and it includes just the tags which are near the searchword. If the matching (opening/ending) is farther away I stay with unbalanced HTML-tags. These unbalanced tags could bread the page layout as the browser tries to balance it uses tags from complete other level.

example

this might be the whole page:

<li>
  <h3>Ang my oniuse.</h3> 
  <p>Oh! any or said faing ear Dand and tion on so wor st wouter and abox 
  a makess stand he he sne at mon the nany ing a me come hink floney a 
  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat seelectler</h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his
  thend that ance, he ned and me lood says wou hed set pidays far it
  conted, and seell yarty.</p>
</li>

searching for seelectler might result in a HTML part like:

  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat <b>seelectler</b></h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his

Now the p tag and the li tags are unbalanced and with the closing tags the browser tries to close the p tag, which might be around the whole found text, and the li-tag which might be around each found entry.
But the next opening of these tags have the wrong css-classes and some div tags between li and p are now unmatched and the closing at the end may close div-tags from column layout.

Result: the complete page layout is broken.

The wished result could be either (all unpaired tags are paired, this can not be foolproof):

<li><p>
  naiday. Smiler yousee lurneremiley boll his a grog.</p>
</li>
<li>
  <h3>I'l hat <b>seelectler</b></h3> 
  <p> Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his
</p></li>    

or:

  naiday. Smiler yousee lurneremiley boll his a grog.
  <h3>I'l hat <b>seelectler</b></h3> 
  Imay e ney, agat nould a fiver, and and hishuch what gook, ley hires
  he cand and onius mon'l, handent a flit's and, th whey, hat wou used his

but this solution might loose important layout e.g. linebreaks.

Does there exist a viewhelper which can cleanup unbalanced HTML-tags either with adding the missing parts or with removing the remained parts?
Is there an algorithm / regexp for detecting unbalanced tags?

Bernd Wilke πφ
  • 10,390
  • 1
  • 19
  • 38

2 Answers2

0

i would suggest stripping all html-tags from the search result. and use plaintext search results.

might create some minor "formatting" by replacing certain tags with linebreaks.

Wolffc
  • 1,176
  • 6
  • 9
0

the nearest solution I have found is with this viewhelper:

<?php
namespace MyCompany\MyExtension\ViewHelpers;

use TYPO3\CMS\Fluid\Core\ViewHelper\AbstractViewHelper;

/**
 * fills in missing xml tags
 */
class BalanceXmlViewHelper extends AbstractViewHelper
{

    /**
     * balances XML-fragment with additional tags
     *
     * @param string $xmlIn
     * @return string
     */
    public function render($xmlIn = null)
    {
        if (null === $xmlIn) {
            $xmlIn = $this->renderChildren();
        }

        $xmlDoc = new \DOMDocument();
        // it's UTF-8 data!
        $xmlDoc->loadHTML('<?xml encoding="UTF-8">' . $xmlIn
              // we want no complete HTML-document, so neglect some default-tags
            , LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD | LIBXML_NOERROR | LIBXML_NOWARNING | LIBXML_NOXMLDECL
        );

        // remove the additional charset tag and replace german umlauts
        $retVal = html_entity_decode(mb_substr($xmlDoc->saveHTML(),23)
                                    ,ENT_COMPAT | ENT_HTML401
                                    );


        return $retVal;
    }
}

I know it could stay with invalid tags (e.g. LI-tags without UL), but it is more precise than removing all tags (stripHTML()), which results in text without linebreaks or even whitespace after removing of block tags.

Bernd Wilke πφ
  • 10,390
  • 1
  • 19
  • 38