0

I am facing a business requirement for a French website that requires matching masculine/feminine/singular and plural versions of a word. The easiest way to describe this is to show the requirement itself in this question.

Req 1 - search for chien (masculine/singular)

The following words should be included in the search results:

  • chien (masculine/singular)
  • chiens (masculine/plural)
  • chienne (feminine/singular)
  • chiennes (feminine/plural)

When I researched this requirement, I used the Analyze API with "fr.microsoft" analyzer to quickly test the various scenarios.

Request #1

{ "analyzer": "fr.microsoft", "text": "chien" }

Response #1

  • chien

Request #2

{ "analyzer": "fr.microsoft", "text": "chiens" }

Response #2

  • chien
  • chiens

Request #3

{ "analyzer": "fr.microsoft", "text": "chienne" }

Response #3

  • chien
  • chienner
  • chienne

Request #4

{ "analyzer": "fr.microsoft", "text": "chiennes" }

Response #4

  • chien
  • chienner
  • chiennes

Req 2 - search for lecteur (masculine/singular)

The following words should be included in the search results:

  • lecteur (masculine/singular)
  • lecteurs (masculine/plural)
  • lectrice (feminine/singular)
  • lectrices (feminine/plural)

I again used the Analyze API with "fr.microsoft" analyzer to quickly test the various scenarios.

Request #1

{ "analyzer": "fr.microsoft", "text": "lecteur" }

Response #1

  • lecteur

Request #2

{ "analyzer": "fr.microsoft", "text": "chiens" }

Response #2

  • lecteur
  • lecteurs

Request #3

{ "analyzer": "fr.microsoft", "text": "lectrice" }

Response #3

  • lecteur
  • lectrice

Request #4

{ "analyzer": "fr.microsoft", "text": "lectrices" }

Response #4

  • lecteur
  • lectrices

My Impressions and Questions

  • My initial impression is that searching "chiennes" would not match a document containing "chienne" because "chiennes" is only broken down to the following: chien, chienner, chiennes.

  • Is that impression correct? Or will searching "chiennes" still return a document containing "chienne" because the search term "chiennes" gets tokenized to chien,chienner,chiennes while the document itself would have "chienne" tokenized to chien,chienner,chienne, so there would ultimately be a match. Note that I italicized the search and index tokens that I think would match.

  • Note that the 2 example requirements above may actually end up being a duplicate of my femme vs femmes S.O. question I posted earlier today: Azure Search: Searching for singular version of a word, but still include plural version in results


Requirement Example 3 - search MELEE

The following words should be included in the search results:

  • MELEE
  • MÊLEE
  • Mêlée
  • mêlant
  • melee
  • mêlé
  • mELer

Request #1

{ "analyzer": "fr.microsoft", "text": "MELEE" }

Response #1

  • melee

Request #2

{ "analyzer": "fr.microsoft", "text": "MÊLEE" }

Response #2

  • melee
  • mêlee

Request #3

{ "analyzer": "fr.microsoft", "text": "Mêlée" }

Response #3

  • meler
  • mêler
  • mele
  • mêle
  • melee
  • mêlee

Request #3

{ "analyzer": "fr.microsoft", "text": "mêlant" }

Response #3

  • meler
  • mêler
  • melant
  • mêlant

In this example, I could continue on with analyze API calls, but here I can compare against the existing website (whose functionality we need to reproduce) and the new website. The existing website allows me to search for "melee" and it will find documents with "mêlant". Screenshot of existing website

But based on the results from Analyze API, I can see that searching "melee" would not find "mêlant" because "melee" only gets tokenized to "melee" while "mêlant" only gets tokenized to meler, mêler, melant and mêlant. There is no match here.

My Impressions and Questions

  • I used Google Translate and can see that "melee" means "scrimmage" or "brawl".
  • I used Google Translate and can see that "mêlant" means "mixing".
  • Is this why a search for "melee" would not match "mêlant"?
  • What are my options if the business demands they match? Would I have to use synonyms? If not, what are my options here?
  • Please note that the existing website uses SOLR and we are not given access to any of the existing code or how SOLR is used. We have had to reverse engineer everything.
  • I did manage to get my hands on the SOLR configuration and it looks like this is how their current SOLR configuration is setup for the french language. It looks like they use a dictionary of some sort.


Please advise.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291

1 Answers1

1

I think I answered the first and second requirement in your other post: Azure Search: Searching for singular version of a word, but still include plural version in results, let me know if I missed something.

With regard to the third requirement, I suspect what happens is that the website you're referring to is using an aggressive stemming strategy. It means that both of the words melee or mêlant are reduced to the same root. On top of stemming they might be using fuzzy search or other query expansions methods, like synonym expansion. The question is whether you want documents with mêlant match the word melee, provided they mean different things.

Both fuzzy search and synonym expansion are possible in Azure Search. You can also experiment with custom analyzers to take control over how stemming is done. We use Lucene components that are the same as the ones used in SOLR so you should be able to replicate the same analyzer configuration in most cases.

Hope that helps.

Community
  • 1
  • 1
Yahnoosh
  • 1,932
  • 1
  • 11
  • 13