I am trying to remove the unwanted words and use stemming and finally create shingles. However, after removing stop words, its giving me shingles with "_" in the place of stop words. I tried using PatternReplaceFactory to replace _ but its not working. I have field type as below:
<fieldType name="common_shingle" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>
<filter class="solr.ShingleFilterFactory" outputUnigrams="false" minShingleSize="3" maxShingleSize="3"/>
</analyzer>
</fieldType>
And when I analyse "A brown fox quickly jumps over the lazy dog". It gives me following result:
- _ brown fox
- brown fox quickli
- fox quickli jump
- quickli jump _
- jump _ _
- _ _ lazi
- _ lazi dog
How do I remove _ from the shingle token. Also, is there a way to create shingles only from stop words?