Index only plain text from HTML in solr

Question

I need to index only plain text from HTML and reject all other HTML tags.

For Example: I have html like

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>
       title
    </title>
    <link href="./test.html" rel="StyleSheet" type="text/css" />
    </head>
    <body>
      <h1 style="height: 22px">
       header
      </h1>
    </body>
</html>

I want to index only 'header' text under the body tag and reject all other HTML tags in _text_ field of solr.

I tried <charFilter class="solr.HTMLStripCharFilterFactory"/> like below:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

But it still indexing the HTML tags attributes

According to solr documentation it should not index the HTML tags solr.HTMLStripCharFilterFactory

When i search solr/testcore/select?q=_text_:height&wt=json it giving me a record which should not be.

I tried in both solr-5.3.1 and solr-6.6.0.

I stuck with this, please help me out.

So how are you indexing this file? And have you confirmed the type of the `_text_` field? — MatsLindh, Oct 30 '18 at 08:08
i am using command /var/www/html/solr-6.0.0/bin/post -p 9000 -c testcore -filetypes htm,html /var/www/html/test/testcore/test.htm and — Vishnu Sharma, Oct 30 '18 at 09:02

MatsLindh · Accepted Answer · 2018-10-30T11:09:29.597

1

Since you're posting the HTML raw to Solr, it's being handled by the extracting request handler ("Solr Cell") - which uses Apache Tika to extract content from the HTML file.

That means that the _text_ field never sees the HTML tags at all, since the content has already been extracted by Apache Tika and the HTML tags have disappeared - so there's nothing to remove.

If you use a Solr client in a programming language of choice and submit the HTML as a field value directly, the the HTML stripping will take place as you expect (since the tags are then actually part of the content submitted to the field types internally in Solr).

I tried finding some way of configuring the HTML Parser in the bundled Tika version - it uses the Tagsoup library to do parsing, but I can't see any exposed configuration that would change what you're experiencing.

edited Oct 30 '18 at 11:09

answered Oct 30 '18 at 09:48

MatsLindh

49,529
4
53
84

Thanks for your response. One observation, when i check the same in 'Analysis' of solr panel against the '_text_' field for same input, it is giving me correct output after filter out all the HTML tags. so, is it something which i am missing. – Vishnu Sharma Oct 30 '18 at 13:56
You're missing what I wrote. When you submit the document the way you're submitting it, the text extraction happens in _Tika_. When it arrives in your analysis chain, the content has already been extracted from the HTML, and there are no HTML tags present. The analysis chain has nothing to remove. To counter this you can submit the HTML content as a regular field in a Solr update, instead of submitting the file to the extracting request handler. – MatsLindh Oct 30 '18 at 13:58

Index only plain text from HTML in solr

1 Answers1