I need to index only plain text from HTML and reject all other HTML tags.
For Example: I have html like
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
title
</title>
<link href="./test.html" rel="StyleSheet" type="text/css" />
</head>
<body>
<h1 style="height: 22px">
header
</h1>
</body>
</html>
I want to index only 'header' text under the body tag and reject all other HTML tags in _text_
field of solr.
I tried <charFilter class="solr.HTMLStripCharFilterFactory"/>
like below:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
But it still indexing the HTML tags attributes
According to solr documentation it should not index the HTML tags solr.HTMLStripCharFilterFactory
When i search solr/testcore/select?q=_text_:height&wt=json
it giving me a record which should not be.
I tried in both solr-5.3.1
and solr-6.6.0
.
I stuck with this, please help me out.