-2

I want to do an exact match on a field which is stemmed. Eg.My data has this value :- "Babysitters at work"

<fieldType name="string_ci_stem" class="solr.TextField" sortMissingLast="true" omitNorms="true">
        <analyzer type="index">
            <tokenizer class="solr.KeywordTokenizerFactory"/>           
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.SnowballPorterFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.SnowballPorterFilterFactory"/>
        </analyzer>

The document getting indexed is "babysitters at work" instead of "babysit at work". I have seen that solr only stems the last word of the sentence when the keywordTokenizer is used.

Is there a way to index "Babysitters at work" as "babysit at work", such that :-

"babysit at work" - return result "babysit work" - doesnot return result.

Any other schema.xml definations which will help to achieve the results?

Any help will be appreciated.

Edit : Updated the question.

  • Hard to understand your problem, but it looks like it's related to stopwords and has nothing to do with stemming. – nomoa Sep 08 '14 at 10:44
  • @nomoa - just edited the question, i typed it wrongly. Sorry for the confusion. Its actually not related to stopwords, as i am not using that. – Kanishka Jain Sep 08 '14 at 10:57
  • OK, IIRC KeywordTokenizer emits one token with the whole input, so "Babysitters at work" will be indexed as a single token "babysitters at work". You should use a StandardAnalyzer which will tokenize on spaces and others. Look at : https://cwiki.apache.org/confluence/display/solr/Tokenizers – nomoa Sep 08 '14 at 12:21

1 Answers1

0

KeywordTokenizerFactory is not designed for your usage as it will index the whole input wihtout spliting input text into tokens like that "Babysitters" "at" "work". You'll get what you want with solr.StandardTokenizerFactory instead of solr.KeywordTokenizerFactory. More info here : https://cwiki.apache.org/confluence/display/solr/Tokenizers

Then if you want to do single term query you'll have to concatenate the emitted tokens into one. I don't know if this kind of filter is available in solr but it should be pretty easy to create your own based on this thread : http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html

  1. Babysitters at work -> StandardTokenizer -> "babysitters" "at" "work"
  2. "babysitters" "at" "work" -> stemming -> "babysit" "at" "work"
  3. "babysit" "at" "work" -> Your Concatenate Filter -> "babysit at work"
nomoa
  • 1,043
  • 6
  • 18
  • How to achieve exact keyword match using StandardTokenizerFactory. I believe it will be a phrase match instead. Eg. :- If I search for "babysitt at work" then indexed data such as "babsitt at work in Boston", should not be returned. StandardizedTokenizer is returning such data. – Kanishka Jain Sep 08 '14 at 12:38
  • Well I suppose you can't do that. Best match is a Phrase search. With StandardTokenizer you could implement a final Concatenate Filter which will concatenate all the tokens into one at the end of the analysis. http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html – nomoa Sep 08 '14 at 12:42
  • Thanks nomoa :) Will try this out. – Kanishka Jain Sep 08 '14 at 13:28