I am writing an autocomplete feature in solr. Ideally autocomplete would
display suggestions if target occurs in any of the words, but prefer exact match over KeywordTokenzierFactory ngram edge match, KeywordTokenzierFactory ngram edge match over StandardTokenizer (or UAX29URLEmailTokenizerFactory) ngram edge match
serve the document along with the suggestion.
show unique suggestions only
This is my attempt at autocompleting:
<field name="category" type="string" indexed="true" stored="true" docValues="true"/>
<field name="categoryAutocompleteExactEdge" type="autocomplete_exact_edge" indexed="true" stored="false"/>
<field name="categoryAutocompleteTermsEdge" type="autocomplete_terms_edge" indexed="true" stored="false"/>
<copyField source="category" dest="categoryAutocompleteExactMatch"/>
<copyField source="category" dest="categoryAutocompleteTermsEdge"/>
<fieldType name="autocomplete_exact_edge" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="autocomplete_terms_edge" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
</fieldType>
handler:
<requestHandler name="/suggest_category" class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<str name="wt">json</str>
<str name="defType">edismax</str>
<str name="rows">5</str>
<str name="fl">category</str>
<str name="qf">category^30 categoryAutocompleteExactEdge^10 categoryAutocompleteTermsEdge</str>
</lst>
</requestHandler>
I think the above handles the order of suggestion in accordance with the first requirement. It also allows you to fetch the document data along with the suggestion by changing fl
. The problem I have is the duplication of suggestion.
If there are many documents with category:"GASTROENTEROLOGIST", then it is possible that category: "GASTRO APPOINTMENT" is never served. If faceting is enabled and rows set to 0, then the qf
ordering is lost.
I am looking for all in one solution, but it appears to me that serving unique suggestions and also displaying document data is mutually exclusive. For example, if I move the categories to a new core, then the suggestion duplication problem vanishes, because I can force uniqueness. But lookups to the new core can't display additional document info.
This is my first time creating an autocomplete functionality and I am not exactly sure how to tackle it. It would be really helpful if someone experienced could explain the best strategies for handling autocompletion. Is creating a new core for every field with autosuggestion the way to go?