Let's say I have a field type as the following:
<fieldType name="text_body" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" catenateWords="1" preserveOriginal="1"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
My goal is to index, for each token, the original token as well as the token after passing all the token filters. For example, for the text:
"My dog is barking #DOGS"
The current field type (as mentioned above) will index the following tokens:
"my", "dog", "bark", "dogs", "#dogs"
"is" will be dropped because of the stopWords filter, and "barking" will become "bark" because of the stemming filter.
I would like that the following tokens will be indexed:
"My", "my", "dog", "barking", "bark", "dogs", "#DOGS".
I considered the "perserveOriginal" parameter in the WordDelimiterGraphFilterFactory but it's only works for tokens with special characters, and also the "original token" passes the other filters after that.
I know that the obvious way is to write a custom TokenFilter that indexes the tokens at their original form right after the tokenizer, but my question is if there is something built in Solr that allows it.
I'm using Solr 6.5.1
Thanks :)