0

This issue is a new situation I am facing after applying a fix for FEMMES.COM not properly tokenizing ( How do I get french text FEMMES.COM to index as language variants of FEMMES )

Failing Test Case: #FEMMES2017 should tokenize to Femmes, Femme, 2017.

It is quite possible my approach to use a MappingCharFilter was not correct, and really just a band-aid. What is the correct approach here to get this failing test case to pass?

Current Index Configuration

  "analyzers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "text_language_search_custom_analyzer",
      "tokenizer": "text_language_search_custom_analyzer_ms_tokenizer",
      "tokenFilters": [
        "lowercase",
        "text_synonym_token_filter",
        "asciifolding",
        "language_word_delim_token_filter"
      ],
      "charFilters": [
        "html_strip",
        "replace_punctuation_with_comma"
      ]
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "text_exact_search_Index_custom_analyzer",
      "tokenizer": "text_exact_search_Index_custom_analyzer_tokenizer",
      "tokenFilters": [
        "lowercase",
        "asciifolding"
      ],
      "charFilters": []
    }
  ],
  "tokenizers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
      "name": "text_language_search_custom_analyzer_ms_tokenizer",
      "maxTokenLength": 300,
      "isSearchTokenizer": false,
      "language": "french"
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.StandardTokenizerV2",
      "name": "text_exact_search_Index_custom_analyzer_tokenizer",
      "maxTokenLength": 300
    }
  ],
  "tokenFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
      "name": "text_synonym_token_filter",
      "synonyms": [
        "ca => ça",
        "yeux => oeil",
        "oeufs,oeuf,Œuf,Œufs,œuf,œufs",
        "etre,ete"
      ],
      "ignoreCase": true,
      "expand": true
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.WordDelimiterTokenFilter",
      "name": "language_word_delim_token_filter",
      "generateWordParts": true,
      "generateNumberParts": true,
      "catenateWords": false,
      "catenateNumbers": false,
      "catenateAll": false,
      "splitOnCaseChange": true,
      "preserveOriginal": false,
      "splitOnNumerics": true,
      "stemEnglishPossessive": true,
      "protectedWords": []
    }
  ],
  "charFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
      "name": "replace_punctuation_with_comma",
      "mappings": [
        "#=>,",
        "$=>,",
        "€=>,",
        "£=>,",
        "%=>,",
        "&=>,",
        "+=>,",
        "/=>,",
        "==>,",
        "<=>,",
        ">=>,",
        "@=>,",
        "_=>,",
        "µ=>,",
        "§=>,",
        "¤=>,",
        "°=>,",
        "!=>,",
        "?=>,",
        "\"=>,",
        "'=>,",
        "`=>,",
        "~=>,",
        "^=>,",
        ".=>,",
        ":=>,",
        ";=>,",
        "(=>,",
        ")=>,",
        "[=>,",
        "]=>,",
        "{=>,",
        "}=>,",
        "*=>,",
        "-=>,"
      ]
    }
  ]

Analyze API Call

{
  "analyzer": "text_language_search_custom_analyzer",
  "text": "#femmes2017"
}

Analyze API Response

{
  "@odata.context": "https://one-adscope-search-eu-prod.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult",
  "tokens": [
    {
      "token": "femmes",
      "startOffset": 1,
      "endOffset": 7,
      "position": 0
    },
    {
      "token": "2017",
      "startOffset": 7,
      "endOffset": 11,
      "position": 1
    }
  ]
}
Community
  • 1
  • 1

1 Answers1

0

The input text is processed by the components of the analyzer in order: char filters -> tokenizer -> token filters. In your case, the tokenizer performs lemmatization before tokens are processed by the WordDelimiter token filter. Unfortunately, Microsoft stemmers and lemmatizers are not available as independent token filters that you could apply after the WordDelimiter token filter. You will need to add another token filter that will normalize the output of the WordDelimiter token filter based on your requirements. It it's only this one case that you care about you could move the SynonymsTokenFilter to the end of the analyzer chain and map femmes to femme. This is obviously not a great workaround as it's very specific to the data you're processing. Hopefuly the information I provided will help you find a more generic solution.

Yahnoosh
  • 1,932
  • 1
  • 11
  • 13
  • That is the advantage the site we are replacing has over us at this point. Their SOLR config allowed this chain. – Andres Becerra May 02 '17 at 03:50
  • You can always use the Lucene Stemmer token filter after the WordDelimiter token filter but remember it will stem all tokens produced by the analyzer. – Yahnoosh May 02 '17 at 03:54
  • Do you mean StemmerTokenFilter on this page? https://learn.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-azure-search The description is "Language specific stemming filter". So that would only perform stemming, and no lemmatization? I guess there is no HunspellStemFilterFactory equivalent that I could just feed this .dic and .aff file the old site has? – Andres Becerra May 02 '17 at 04:01
  • Yes, that site has the list of all supported token filters in Azure Search. The Stemmer token filter performs simple, language-specific stemming, no lemmatization. Unfortunately, Hunspell stemmer token filter is not supported at this moment. – Yahnoosh May 02 '17 at 04:03
  • We ended up getting around this by pre-processing text before sending it to Azure. So now our azure index has 3 fields for every searchable field. For example, we have a title field (retrievable = true), and a titleLangSearch field (uses a custom analyzer) and a titleExactSearch field (a different custom analyzer). We pre-process the data going into titleLangSearch field (removing punctuation, handling word breaks, etc) and then upload to azure index. It would be most useful if Azure supported language logic AFTER all the token filters. – Andres Becerra May 11 '17 at 16:48