Azure cognitive search - How to prevent a EdgeNGram tokenizer to not break the words at hyphen?

Question

Here is how I am creating the Azure search index for cosmos db documents with SearchRequest Model (has excluded some fields from SearchRequest Model for brevity).

Please suggest the changes needed in the below implementation to prevent the edgeNgramTokenFilterV2 token filter not to break the words at a hyphen.

public class SearchRequest
{
    [SimpleField(IsKey = true, IsFilterable = true)]
    public string id { get; set; }

    [SearchableField(SearchAnalyzerName = LexicalAnalyzerName.Values.StandardLucene, IndexAnalyzerName = "prefixEdgeAnalyzer")]
    public string EntityID { get; set; }

    public MetaData? MetaData { get; set; }
}

public class MetaData
{
    [SearchableField(AnalyzerName = LexicalAnalyzerName.Values.EnMicrosoft)]
    public string? CustomerName { get; set; }

    [SearchableField(SearchAnalyzerName = LexicalAnalyzerName.Values.StandardLucene, IndexAnalyzerName = "prefixEdgeAnalyzer")]
    public List<string>? OpportunityIDs { get; set; }

}




    public async Task<Response<SearchIndex>> CreateIndex(string indexName)
    {
        try
        {
            var nedgeTokenfilter = new EdgeNGramTokenFilter("edgeNgramTokenFilterV2");
            nedgeTokenfilter.MinGram = 3;
            nedgeTokenfilter.MaxGram = 20;
            nedgeTokenfilter.Side = EdgeNGramTokenFilterSide.Front;

            var prefixEdgeAnalyzer = new CustomAnalyzer("prefixEdgeAnalyzer", LexicalTokenizerName.Standard);
            prefixEdgeAnalyzer.TokenFilters.Add(TokenFilterName.Lowercase);
            prefixEdgeAnalyzer.TokenFilters.Add("edgeNgramTokenFilterV2");

            var suggester = new SearchSuggester("spellCheckSuggester", $"MetaData/{nameof(SearchRequest.MetaData.CustomerName)}"); //for spell check

            FieldBuilder fieldBuilder = new FieldBuilder();
            var searchFields = fieldBuilder.Build(typeof(SearchRequest));

            var definition = new SearchIndex(indexName, searchFields);

            definition.TokenFilters.Add(nedgeTokenfilter);
            definition.Analyzers.Add(prefixEdgeAnalyzer);
            definition.Suggesters.Add(suggester);

            var response = await _adminClient.CreateOrUpdateIndexAsync(definition).ConfigureAwait(false);


            return response;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, ex.Message);
            throw;
        }
    }

On using Analyze API, I can see that text - "7-ETREW" if tokenised as etr, etre, etrew. While I need to get tokenized as 7-e, 7-et, 7-etr, 7-etre, 7-etrew.

https://{myServicename}.search.windows.net/indexes/{MyIndexname}/analyze?api-version=2020-06-30
{
  "text": "7-etrew",
  "analyzer": "prefixEdgeAnalyzer"
}

score 0 · Answer 1 · answered Apr 05 '23 at 18:04

0

This is likely due to your usage of the "LexicalTokenizerName.Standard" tokenizer, which breaks down token based on various delimiters (such as '-'). If you only want to break down on whitespace, you could use the Whitespace tokenizer, or if you don't want to break down on any syntax at all, you could try the Keyword analyzer.

answered Apr 05 '23 at 18:04

ramero-MSFT

920
5
10

That worked! Thankyou! Are there any caveats of using a whitespace tokenizer over standard tokenizer? – PRACHI AGARWAL Apr 06 '23 at 01:57
I am facing another issue now. Direct search via 7-etr works, but fielded search is not working, I am searching via - MetaData/OpportunityIDs:\*7-etr\* – PRACHI AGARWAL Apr 06 '23 at 02:48
There is no caveat - it's matter of how you want your text to be broken down. If you want it to only be broken down on whitespace, then Whitespace tokenizer is the right choice. Regarding your other issue, I would suggest you create a new post and include more information (index definition, full request, etc.) – ramero-MSFT Apr 06 '23 at 05:24

Azure cognitive search - How to prevent a EdgeNGram tokenizer to not break the words at hyphen?

1 Answers1