How to use an ngram and edge ngram tokenizer together in elasticsearch index?

Question

I have an index containing 3 documents.

            {
                    "firstname": "Anne",
                    "lastname": "Borg",
                }

            {
                    "firstname": "Leanne",
                    "lastname": "Ray"
                },

            {
                    "firstname": "Anne",
                    "middlename": "M",
                    "lastname": "Stone"
                }

When I search for "Anne", I would like elastic to return all 3 of these documents (because they all match the term "Anne" to a degree). BUT, I would like Leanne Ray to have a lower score (relevance ranking) because the search term "Anne" appears at a later position in this document than the term appears in the other two documents.

Initially, I was using an ngram tokenizer. I also have a generated field in my index's mapping called "full_name" that contains the firstname, middlename and lastname strings. When I searched for "Anne", all 3 documents are in the result set. However, Anne M Stone has the same score as Leanne Ray. Anne M Stone should have a higher score than Leanne.

To address this, I changed my ngram tokenizer to an edge_ngram tokenizer. This had the effect of completely leaving out Leanne Ray from the result set. We would like to keep this result in the result set - because it still contains the query string - but with a lower score than the other two better matches.

I read somewhere that it may be possible to use the edge ngram filter alongside an ngram filter in the same index. If so, how should I recreate my index to do so? Is there a better solution?

Here are the initial index settings.

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "filter": [
                        "lowercase"
                    ],
                    "type": "custom",
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "token_chars": [
                        "letter",
                        "digit",
                        "custom"
                    ],
                    "custom_token_chars": "'-",
                    "min_gram": "3",
                    "type": "ngram",
                    "max_gram": "4"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "contact_id": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },

            "firstname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },


            "lastname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },

            "middlename": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },

            "full_name": {
                "type": "text",
                "analyzer": "my_analyzer",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

And here is my query

{
    "query": {
        "bool": {
            "should": [
                {
                    "query_string": {
                        "query": "Anne",
                        "fields": [
                            "full_name"
                        ]
                    }
                }
            ]
        }
    }
}

This brought back these results

    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": 0.59604377,
        "hits": [
            {
                "_index": "contacts_15",
                "_type": "_doc",
                "_id": "3",
                "_score": 0.59604377,
                "_source": {
                    "firstname": "Anne",
                    "lastname": "Borg"
                }
            },
            {
                "_index": "contacts_15",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.5592099,
                "_source": {
                    "firstname": "Anne",
                    "middlename": "M",
                    "lastname": "Stone"
                }
            },
            {
                "_index": "contacts_15",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.5592099,
                "_source": {
                    "firstname": "Leanne",
                    "lastname": "Ray"
                }
            }
        ]
    }

If I instead use an edge ngram tokenizer, this is what the index's settings look like...

{
    "settings": {
        "max_ngram_diff": "10",
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "filter": [
                        "lowercase"
                    ],
                    "type": "custom",
                    "tokenizer": ["edge_ngram_tokenizer"]
                }
            },
            "tokenizer": {
                "edge_ngram_tokenizer": {
                    "token_chars": [
                        "letter",
                        "digit",
                        "custom"
                    ],
                    "custom_token_chars": "'-",
                    "min_gram": "2",
                    "type": "edge_ngram",
                    "max_gram": "10"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "contact_id": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },

            "firstname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },


            "lastname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },

            "middlename": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                },
                "copy_to": [
                    "full_name"
                ]
            },

            "full_name": {
                "type": "text",
                "analyzer": "my_analyzer",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

and that same query brings back this new result set...

   "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 1.5131824,
        "hits": [
            {
                "_index": "contacts_16",
                "_type": "_doc",
                "_id": "3",
                "_score": 1.5131824,
                "_source": {
                    "firstname": "Anne",
                    "middlename": "M",
                    "lastname": "Stone"
                }
            },
            {
                "_index": "contacts_16",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.4100108,
                "_source": {
                    "firstname": "Anne",
                    "lastname": "Borg"
                }
            }
        ]
    }

Val · Accepted Answer · 2020-05-13T05:59:07.397

You can keep using ngram (i.e. first solution) but then you need to change your query to improve the relevance. The way it works is that you add a boosted multi_match query in a should clause to increase the score of documents whose first or last name match exactly with the input:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "Anne",
            "fields": [
              "full_name"
            ]
          }
        }
      ],
      "should": [
        {
          "multi_match": {
            "query": "Anne",
            "fields": [
              "firstname",
              "lastname"
            ],
            "boost": 10
          }
        }
      ]
    }
  }
}

This query would bring Anne Borg and Anne M Stone before Leanne Ray.

UPDATE

Here is how I arrived at the results.

First I created a test index with the exact same settings/mappings as you have added to your question:

PUT test
{ ... copy/pasted mappings/settings ... }

Then I added the three sample documents you provided:

POST test/_doc/_bulk
{"index":{}}
{"firstname":"Anne","lastname":"Borg"}
{"index":{}}
{"firstname":"Leanne","lastname":"Ray"}
{"index":{}}
{"firstname":"Anne","middlename":"M","lastname":"Stone"}

Finally, if you run my query above, you get the following results, which is exactly what you expect (look at the scores):

{
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 5.1328206,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "4ZqbDHIBhYuDqANwQ-ih",
        "_score" : 5.1328206,
        "_source" : {
          "firstname" : "Anne",
          "lastname" : "Borg"
        }
      },
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "45qbDHIBhYuDqANwQ-ih",
        "_score" : 5.0862665,
        "_source" : {
          "firstname" : "Anne",
          "middlename" : "M",
          "lastname" : "Stone"
        }
      },
      {
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "4pqbDHIBhYuDqANwQ-ih",
        "_score" : 0.38623023,
        "_source" : {
          "firstname" : "Leanne",
          "lastname" : "Ray"
        }
      }
    ]
  }
}

You should try again from scratch, because all I reported above was based on real tests with your mapping and sample data you provided, not suppositions ;-) Or update your question with what you're trying to do and we can see where that itches — Val, May 12 '20 at 19:10
Thanks Val. I made my question a bit more straightforward. Is this helpful? What other information would it be helpful to share? I can't think of anything else. — GNG, May 12 '20 at 22:24
I've updated my answer to show you how I created my query and what results it returns. I haven't changed the query since it works the way you expect. — Val, May 13 '20 at 05:59
Thanks. My bad. Your query does indeed answer my question. There's another issue that I'm putting in a new question because your answer is complete for this question. Maybe you can shed light on this perhaps more typical situation https://stackoverflow.com/questions/61768534/assign-a-higher-score-to-matches-containing-the-search-query-at-an-earlier-posit — GNG, May 13 '20 at 07:18

How to use an ngram and edge ngram tokenizer together in elasticsearch index?

1 Answers1

Linked