autocomplete with ngrams generates duplicates

Question

I am writing an autocomplete feature in solr. Ideally autocomplete would

display suggestions if target occurs in any of the words, but prefer exact match over KeywordTokenzierFactory ngram edge match, KeywordTokenzierFactory ngram edge match over StandardTokenizer (or UAX29URLEmailTokenizerFactory) ngram edge match
serve the document along with the suggestion.
show unique suggestions only

This is my attempt at autocompleting:

  <field name="category" type="string" indexed="true" stored="true" docValues="true"/>
  <field name="categoryAutocompleteExactEdge" type="autocomplete_exact_edge" indexed="true" stored="false"/>
  <field name="categoryAutocompleteTermsEdge" type="autocomplete_terms_edge" indexed="true" stored="false"/>
  <copyField source="category" dest="categoryAutocompleteExactMatch"/>
  <copyField source="category" dest="categoryAutocompleteTermsEdge"/>

  <fieldType name="autocomplete_exact_edge" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
    </analyzer>
  </fieldType>


<fieldType name="autocomplete_terms_edge" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
  </analyzer>
</fieldType>

handler:

 <requestHandler name="/suggest_category" class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
      <str name="wt">json</str>
      <str name="defType">edismax</str>
      <str name="rows">5</str>
      <str name="fl">category</str>
      <str name="qf">category^30 categoryAutocompleteExactEdge^10 categoryAutocompleteTermsEdge</str>
    </lst>
  </requestHandler>

I think the above handles the order of suggestion in accordance with the first requirement. It also allows you to fetch the document data along with the suggestion by changing fl. The problem I have is the duplication of suggestion.

If there are many documents with category:"GASTROENTEROLOGIST", then it is possible that category: "GASTRO APPOINTMENT" is never served. If faceting is enabled and rows set to 0, then the qf ordering is lost.

I am looking for all in one solution, but it appears to me that serving unique suggestions and also displaying document data is mutually exclusive. For example, if I move the categories to a new core, then the suggestion duplication problem vanishes, because I can force uniqueness. But lookups to the new core can't display additional document info.

This is my first time creating an autocomplete functionality and I am not exactly sure how to tackle it. It would be really helpful if someone experienced could explain the best strategies for handling autocompletion. Is creating a new core for every field with autosuggestion the way to go?

could you try by adding RemoveDuplicatesTokenFilterFactory in your fields defina]ition? https://solr.apache.org/guide/8_8/filter-descriptions.html#remove-duplicates-token-filter — Abhijit Bashetti, Feb 02 '23 at 11:51
Thanks for stopping by @AbhijitBashetti hm, removing duplicates tokens in a stream just makes sure the indexed terms for the document field in question does not have duplicates. I could still leave me with thousands of documents with a category GASTROENTEROLOGIST before GASTRO APPOINTMENT showed up in suggestion. — sanjihan, Feb 02 '23 at 12:31
Creating a new core for every field does not seem like an ideal solution. Could you add an example of what you get with your query and what you expect instead? — Seasers, Feb 08 '23 at 10:46
@Seasers. I also didn't like the idea of maintaining a separate core with unique values just for autocompletion. After many tests I found that build in BlendedInfixLookupFactory does the job the way I wanted. I don't have the scheme with ngrams anymore, but basically you get a hit for every document that matches category. Such search is useful if you search on Product Name - because you wont have 100k products with the exact same name, maybe just a few. On the other hand ngrams are useless for searching on category, because you do have 100k of docs beloning to same category. — sanjihan, Feb 08 '23 at 22:23

autocomplete with ngrams generates duplicates

0 Answers0