ElasticSearch fuzzy incremental search strategy and index creation

Question

I have developed an ElasticSearch (ES) index to meet a user's search need. The language used is NestJS, but that is not important. The search is done from one input field. As you type, results are updated in a list.

The workflow is as follows : Input field -> interpretation of the value -> construction of an ES query -> Sending to ES -> Return results.

Interpreting the value:

Depending on what is entered, it can guide the search towards specifics fields.

Examples: "hello" -> could be interpreted as a usual firstname, usual lastname, birth firstname, birth lastname, or an email. "hello@" -> at this stage, the presence of "@" eliminates any other search except for email. "2000" -> will be interpreted as a phone number or a birth year. "Laurent 58" -> could be interpreted as a usual firstname, usual lastname, birth firstname, birth lastname, phone number, age, or birth year (1958, not 2058). etc.

Output for "Laurent 58" :

[
    { data: 'Laurent', strict: false, type: [ 'Name' ] },
    {
      data: '58',
      strict: false,
      type: [ 'Age', 'Phone' ]
    },
    { data: '1958', strict: false, type: [ 'Year' ] }
]

ES Query Construction

Following this analysis, a query is generated with query elements and integrated into a general query. The firstnames and lastnames fields are grouped using the copy_to property into a field called fullname_concat. The age field is calculated based on the date of birth. A sub-query element is generated using previous objects, and then integrated into the ES query model.

Example from "Laurent 58":

{
        "dis_max": {
            "queries": [
                {
                    "match": {
                        "fullname_concat": {
                            "query": "Laurent",
                            "fuzziness": "AUTO"
                        }
                    }
                },
                {
                    "term": {
                        "age": {
                            "value": "58"
                        }
                    }
                },
                {
                    "prefix": {
                        "main_international_phone": {
                            "value": "58"
                        }
                    }
                },
                {
                    "prefix": {
                        "main_national_phone": {
                            "value": "58"
                        }
                    }
                },
                {
                    "match": {
                        "birthdate_year": {
                            "query": "1958"
                        }
                    }
                }
            ],
            "tie_breaker": 0.7
        }
    },
    "fields": [
        "age"
    ],
    "runtime_mappings": {
        "age": {
            "type": "long",
            "script": {
                "source": "if (doc['birthdate'].size() == 0) { emit(0) } else { emit((System.currentTimeMillis() - doc['birthdate'].value.getMillis())/31556952000L) }"
            }
        }
    },
    "explain": false,
    "from": 0,
    "size": 30
}

ElasticSearch Index

The query is executed on the following index. I am performing searches on French language, including accents, "-" and " ' ".

{
    "myindex": {
        "aliases": {
            "myalias": {
                "is_write_index": true
            }
        },
        "mappings": {
            "properties": {
                "birth_firstname": {
                    "type": "keyword",
                    "copy_to": [
                        "fullname_concat",
                        "firstname_concat"
                    ]
                },
                "birth_surname": {
                    "type": "keyword",
                    "copy_to": [
                        "fullname_concat",
                        "surname_concat"
                    ]
                },
                "usual_firstname": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword"
                        }
                    },
                    "copy_to": [
                        "fullname_concat",
                        "firstname_concat"
                    ]
                },
                "usual_surname": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword"
                        }
                    },
                    "copy_to": [
                        "fullname_concat",
                        "surname_concat"
                    ]
                },
                "birthdate": {
                    "type": "date",
                    "format": "dd/MM/yyyy"
                },
                "birthdate_day": {
                    "type": "integer"
                },
                "birthdate_month": {
                    "type": "integer"
                },
                "birthdate_year": {
                    "type": "integer"
                },
                "firstname_concat": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword"
                        }
                    },
                    "analyzer": "my_analyzer"
                },
                "surname_concat": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword"
                        }
                    },
                    "analyzer": "my_analyzer"
                },
                "fullname_concat": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword"
                        }
                    },
                    "analyzer": "my_analyzer"
                },
                "main_email": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword"
                        }
                    },
                    "analyzer": "my_analyzer"
                },
                "main_international_phone": {
                    "type": "text"
                },
                "main_national_phone": {
                    "type": "text"
                }               
            }
        },
        "settings": {
            "index": {
                "routing": {
                    "allocation": {
                        "include": {
                            "_tier_preference": "data_content"
                        }
                    }
                },
                "number_of_shards": "1",
                "provided_name": "myindex",
                "creation_date": "1681316810583",
                "analysis": {
                    "analyzer": {
                        "my_analyzer": {
                            "filter": [
                                "lowercase",
                                "asciifolding",
                                "trim"
                            ],
                            "type": "custom",
                            "tokenizer": "my_tokenizer"
                        }
                    },
                    "tokenizer": {
                        "my_tokenizer": {
                            "token_chars": [
                                "letter",
                                "digit",
                                "custom"
                            ],
                            "custom_token_chars": "'-",
                            "min_gram": "2",
                            "max_gram": "2",
                            "type": "edge_ngram"
                            
                        }
                    }
                },
                "number_of_replicas": "1",
                "uuid": "E7anyel9T7a5GNas7HMutA",
                "version": {
                    "created": "8060299"
                }
            }
        }
    }
}

Issues

I don't think that my index is set up correctly. For example, there is a difference in results between "laurent" and "Laurent," which should not be the case if I understand the use of "lowercase" in the filter. I don't believe that my search strategy is effective as the returned results are not coherent with the search query.

For example "Laurent" on fullname_concat field :

**Laurent** xxxxxxx / score : 4.5331917
Labri xxxxxxx / score : 4.5331917
Laayachi xxxxxxx / score : 4.5331917
Latifa xxxxxxx / score : 4.5331917
Latifa xxxxxxx / score : 4.5331917
Lahoucine xxxxxxx / score : 4.5331917
Larbi xxxxxxx / score : 4.5331917
Lakhdar xxxxxxx / score : 4.5331917
Laetitia xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
Laure xxxxxxx / score : 4.5331917
Lazhar xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
Laurence xxxxxxx / score : 4.5331917
Laetitia xxxxxxx / score : 4.5331917
Lahsen xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
Laurence xxxxxxx / score : 4.5331917
Laurence xxxxxxx / score : 4.5331917
Laurence xxxxxxx / score : 4.5331917
Laetitia xxxxxxx / score : 4.5331917
Laurie xxxxxxx / score : 4.5331917
Laurynn xxxxxxx / score : 4.5331917
Laetitia xxxxxxx / score : 4.5331917
Laurence xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
Laurence xxxxxxx / score : 4.5331917
Lahcene xxxxxxx / score : 4.5331917
Laure xxxxxxx / score : 4.5331917
Laura xxxxxxx / score : 4.5331917
Larbi xxxxxxx / score : 4.5331917
Lahcene xxxxxxx / score : 4.5331917
Laure xxxxxxx / score : 4.5331917
Latifa xxxxxxx / score : 4.5331917
Lahcen xxxxxxx / score : 4.5331917
Lahcene xxxxxxx / score : 4.5331917
Laetitia xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
Laurence xxxxxxx / score : 4.5331917
Laura xxxxxxx / score : 4.5331917
Laure xxxxxxx / score : 4.5331917
Lahcene xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
Laura xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917
**Laurent** xxxxxxx / score : 4.5331917

Edit : I obtain the same kind of results with this query :

{
    "query": {
        "bool": {
            "must": [
                        {
                            "match": {
                                "fullname_concat": {
                                    "query": "laurent"
                                }
                            }
                        }
                ]
        }
    },
    "explain": false,
    "from": 0,
    "size": "50"
}

The score is the same for all. I don't understand how it works. Moreover the score change a lot depending on how the generated query, therefor I can't really use the min_score option.

For "djmpdosiq" (random) :

Djemel xxxxxxx / score : 12.715759
Dje xxxxxxx / score : 11.359609
Naima xxxxxxx / score : 10.895781
Embarka xxxxxxx / score : 10.895781
Djida xxxxxxx / score : 10.895781
Anais xxxxxxx / score : 10.895781
Nadia xxxxxxx / score : 10.895781
Lyna xxxxxxx / score : 10.895781
Anasthasie xxxxxxx / score : 10.895781
Hassouna xxxxxxx / score : 10.895781
Adam xxxxxxx / score : 10.895781
Djamel xxxxxxx / score : 10.895781
Djeyendirane xxxxxxx / score : 10.895781
Djamilat xxxxxxx / score : 10.895781
Djeneba xxxxxxx / score : 10.895781
Djamal xxxxxxx / score : 10.895781
Sonia xxxxxxx / score : 10.895781
Djamel xxxxxxx / score : 10.895781
Djallil xxxxxxx / score : 10.895781
Djoti xxxxxxx / score : 10.895781
Imene xxxxxxx / score : 10.895781
Leila xxxxxxx / score : 10.895781
Fatiha xxxxxxx / score : 10.895781
Mohamed xxxxxxx / score : 10.895781
Zahia xxxxxxx / score : 10.895781
Djamila xxxxxxx / score : 10.895781
Corinne xxxxxxx / score : 10.266362
...

It scores higher than "Laurent" for "Laurent". And in that example the query is similar.

Additionally, perhaps due to the "dis_max" query, the more text I add in the input field, the more it opens up the possibilities of results instead of narrowing them down and filtering out tuples.

Is there someone with some experience in this area who can guide me on the strategy I should use for my search and let me know if there are any issues with my index?

Thanks a lot !

EDIT 2023/04/17:

I changed the ngrams max_size (to 20), the result is better. I misunderstood this parameter. Still it's not perfect. Some people with different firstname or light variation come first. The search is done on fullname_concat.

For "Pierre", I got :

**** Pierrel
Pierre ****
Pierre ****
**** **** Pierredon
Pierre **** ****
**** Pierret
Pierre ****
Pierrette ****
**** Pierret
Pierre ****
Pierre ****
...

It seems that I can't use fuzzyness and expect to have good results. It's better without it. I also have issue when I search on multiple criteria. And I still don't understand how the score is computed as my query change depending of the 2nd step (input to tokens), so I can't filter low result.

Is it a good strategy to generate multiple search token ? For exemple "Pierre 58" search in DIS_MAX with name, age, year, international_phone, national_phone ? Or will it always return a marmelade of results ? Thanks

You mistaken the usage of edge_ngram, your settting "min_gram": "2","max_gram": "2" analyzes "djmpdosiq" to "dj", Laurence to "la". It does not make any sense. You can try to increase max_gram. — Mathew, Apr 13 '23 at 16:38
use _analyze api to test your token. https://www.elastic.co/guide/en/elasticsearch/reference/7.17/test-analyzer.html — Mathew, Apr 13 '23 at 16:40
Hi @Mathew ! Thanks, yes I misunderstood the use of n-grams size. It's more consistent now that I set max_size to 20. But still, it's not perfect. I'm going to edit my question. — Xav, Apr 17 '23 at 10:03

ElasticSearch fuzzy incremental search strategy and index creation

Interpreting the value:

ES Query Construction

ElasticSearch Index

Issues

0 Answers0