Indexing original tokens in solr

Question

Let's say I have a field type as the following:

<fieldType name="text_body" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" catenateWords="1" preserveOriginal="1"/>
    <filter class="solr.FlattenGraphFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

My goal is to index, for each token, the original token as well as the token after passing all the token filters. For example, for the text:

"My dog is barking #DOGS"

The current field type (as mentioned above) will index the following tokens:

"my", "dog", "bark", "dogs", "#dogs"

"is" will be dropped because of the stopWords filter, and "barking" will become "bark" because of the stemming filter.

I would like that the following tokens will be indexed:

"My", "my", "dog", "barking", "bark", "dogs", "#DOGS".

I considered the "perserveOriginal" parameter in the WordDelimiterGraphFilterFactory but it's only works for tokens with special characters, and also the "original token" passes the other filters after that.

I know that the obvious way is to write a custom TokenFilter that indexes the tokens at their original form right after the tokenizer, but my question is if there is something built in Solr that allows it.

I'm using Solr 6.5.1

Thanks :)

What's the use case for keeping the original tokens in the same field? Why not a dedicated field that contains the original tokens without filters applied? — MatsLindh, Apr 26 '20 at 11:25
@MatsLindh Of course your suggestion is possible and could fit, I'm still considering all the options. I just want to know if it's even possible to keep it on the same field (I mean using something build in solr, without writing additional plugins) before I even considering it as an option. Thanks :) — Barry, Apr 27 '20 at 12:10
In that case I'd suggest doing that - using separate fields for the same content but with different processing is one of the core tenants of Lucene and Solr, instead of intermingling differently processed tokens. — MatsLindh, Apr 27 '20 at 12:49

score 1 · Answer 1 · answered May 01 '20 at 18:38

Nice question related to maintaining relevany of search for natura language, probably following will help.

If fields to search on are only the fields of mentioned filedType i.e. "text_body", and you want to have both stemmed and original tokans for searching for all the fileds in your list of fileds to search on;

Then try creating an additional field (say field_withoutStemmer) with another fieldType like "text_body" but without following filter:

<filter class="solr.PorterStemFilterFactory"/>

In addition to this if you are using dismax/edismax query parser, then you may want to set "tie" parameter with non-zero value (probably tie=1.0).

Setting "tie=1.0" will generate document's score = sum of scores for both fields of the matched document; otherwise it will disjunct and you will have only highest score of one among both fields.

Indexing original tokens in solr

1 Answers1