0

I need FEMMES.COM to get tokenized as singular + plural forms of the base word FEMME.

Custom Analyzer Config

"analyzers": [ { "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer", "name": "text_language_search_custom_analyzer", "tokenizer": "text_language_search_custom_analyzer_ms_tokenizer", "tokenFilters": [ "lowercase", "asciifolding" ], "charFilters": [ "html_strip" ] } ], "tokenizers": [ { "@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer", "name": "text_language_search_custom_analyzer_ms_tokenizer", "maxTokenLength": 300, "isSearchTokenizer": false, "language": "english" } ], "tokenFilters": [], "charFilters": []}

Analyze API call for FEMMES

{ "analyzer": "text_language_search_custom_analyzer", "text": "FEMMES" }

Analyze API response for FEMMES

{ "@odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 } ] }

Analyze API response for FEMMES.COM

{ "@odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "com", "startOffset": 7, "endOffset": 10, "position": 1 } ] }

Analyze API response for FEMMES COM

{ "@odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "com", "startOffset": 7, "endOffset": 10, "position": 1 } ]}

2 Answers2

1

I think I figured this one out myself after some experimentation. I found the MappingCharFilter could be used to replace . with , before the indexer did the tokenization. This allowed the lemmatization/stemming to work as expected on the terms in question. I need to do more thorough integration tests with our other use cases, but I think this would solve the problem for anybody facing the same type of issue.

  • FYI, I had another edge case where this data... http://www.sarenza.com/selection-chaussures-grandes-occasions-femmes?abv=1 would not properly tokenize femmes into femme and femmes. I ultimately solved this by including several more characters in my CharMappingFilter, and replacing all of them with comma. #$€£%&+/=<>@_µ§¤°!?"'`~^.:;()[]{}*- The only I could not replace was backslash \. The Azure Search Service would fail creating the index when I included that backslash in a char mapping so I left that one out. This solves this edge case for me. – Andres Becerra Apr 29 '17 at 14:08
0

My previous answer was not correct. Azure Search implementation actually applies the language tokenizer BEFORE token filters. This essentially made the WordDelimiterToken filter useless in my use case.

What I ended up having to do was to pre-process data BEFORE I uploaded to Azure for indexing. In my C# code, I added some regex logic that would break apart text like FEMMES2017 into FEMMES 2017, before I sent it to Azure. This way, when the text got to Azure, the indexer would see FEMMES by itself and properly tokenize as FEMME and FEMMES using the language tokenizer.