I am facing a business requirement for a French website that requires matching masculine/feminine/singular and plural versions of a word. The easiest way to describe this is to show the requirement itself in this question.
Req 1 - search for chien (masculine/singular)
The following words should be included in the search results:
- chien (masculine/singular)
- chiens (masculine/plural)
- chienne (feminine/singular)
- chiennes (feminine/plural)
When I researched this requirement, I used the Analyze API with "fr.microsoft" analyzer to quickly test the various scenarios.
Request #1
{ "analyzer": "fr.microsoft", "text": "chien" }
Response #1
- chien
Request #2
{ "analyzer": "fr.microsoft", "text": "chiens" }
Response #2
- chien
- chiens
Request #3
{ "analyzer": "fr.microsoft", "text": "chienne" }
Response #3
- chien
- chienner
- chienne
Request #4
{ "analyzer": "fr.microsoft", "text": "chiennes" }
Response #4
- chien
- chienner
- chiennes
Req 2 - search for lecteur (masculine/singular)
The following words should be included in the search results:
- lecteur (masculine/singular)
- lecteurs (masculine/plural)
- lectrice (feminine/singular)
- lectrices (feminine/plural)
I again used the Analyze API with "fr.microsoft" analyzer to quickly test the various scenarios.
Request #1
{ "analyzer": "fr.microsoft", "text": "lecteur" }
Response #1
- lecteur
Request #2
{ "analyzer": "fr.microsoft", "text": "chiens" }
Response #2
- lecteur
- lecteurs
Request #3
{ "analyzer": "fr.microsoft", "text": "lectrice" }
Response #3
- lecteur
- lectrice
Request #4
{ "analyzer": "fr.microsoft", "text": "lectrices" }
Response #4
- lecteur
- lectrices
My Impressions and Questions
My initial impression is that searching "chiennes" would not match a document containing "chienne" because "chiennes" is only broken down to the following: chien, chienner, chiennes.
Is that impression correct? Or will searching "chiennes" still return a document containing "chienne" because the search term "chiennes" gets tokenized to chien,chienner,chiennes while the document itself would have "chienne" tokenized to chien,chienner,chienne, so there would ultimately be a match. Note that I italicized the search and index tokens that I think would match.
Note that the 2 example requirements above may actually end up being a duplicate of my femme vs femmes S.O. question I posted earlier today: Azure Search: Searching for singular version of a word, but still include plural version in results
Requirement Example 3 - search MELEE
The following words should be included in the search results:
- MELEE
- MÊLEE
- Mêlée
- mêlant
- melee
- mêlé
- mELer
Request #1
{ "analyzer": "fr.microsoft", "text": "MELEE" }
Response #1
- melee
Request #2
{ "analyzer": "fr.microsoft", "text": "MÊLEE" }
Response #2
- melee
- mêlee
Request #3
{ "analyzer": "fr.microsoft", "text": "Mêlée" }
Response #3
- meler
- mêler
- mele
- mêle
- melee
- mêlee
Request #3
{ "analyzer": "fr.microsoft", "text": "mêlant" }
Response #3
- meler
- mêler
- melant
- mêlant
In this example, I could continue on with analyze API calls, but here I can compare against the existing website (whose functionality we need to reproduce) and the new website. The existing website allows me to search for "melee" and it will find documents with "mêlant". Screenshot of existing website
But based on the results from Analyze API, I can see that searching "melee" would not find "mêlant" because "melee" only gets tokenized to "melee" while "mêlant" only gets tokenized to meler, mêler, melant and mêlant. There is no match here.
My Impressions and Questions
- I used Google Translate and can see that "melee" means "scrimmage" or "brawl".
- I used Google Translate and can see that "mêlant" means "mixing".
- Is this why a search for "melee" would not match "mêlant"?
- What are my options if the business demands they match? Would I have to use synonyms? If not, what are my options here?
- Please note that the existing website uses SOLR and we are not given access to any of the existing code or how SOLR is used. We have had to reverse engineer everything.
- I did manage to get my hands on the SOLR configuration and it looks like this is how their current SOLR configuration is setup for the french language. It looks like they use a dictionary of some sort.
Please advise.