2

I am searching an elasticsearch index containing human names and addresses. The relevance ranking is good but not as good as it needs to be. It is also too slow.

Our index includes a combination of ngram and edge_ngram analyzers. Our queries are boolean queries including query string and multimatch queries.

The ngrams allow us to search for mispelled names more quickly than using a fuzzy search.

The edge ngrams allow us to assign a higher score to terms that appear in the same order in a document. I think this only works when the terms are spelled exactly as they appear in the index.

We have overridden the default similarity module to effectively turn off TFIDF, since that is mostly irrelevant for searching names, i.e. proper nouns.

How can we further improve these index settings and query structure to improve relevance ranking?

In particular, one issue we have with this setup is that elasticsearch undesirably boosts the score of documents in which a search term appears multiple times. For example, a search for "Sally Smit" assigns a higher score to Sandy Smith who lives at 10 Smith Rd than to Sally Smith who lives at 10 Plumb Rd. Of course, when a user searches "Sally Smit", they want to see people who have the name "Sally Smith" at the top of the results list.

Another important angle here is that word relevance matters. For example, searching Allan Joseph should assign a higher score to Allan M Joseph than to Joseph Allan.

In general, it is difficult to find best practices for searching human names in an elasticsearch index. I have searched stack overflow and the elasticsearch forums. It would be helpful if you know how to fix these index settings, mapping and/or query to improve relevance ranking for human names and addresses or if you can point me toward better information and examples than I've been able to find here on stack overflow and the elasticsearch forums.

I'm pasting our index settings, an example query and result set below...

Our index settings and mappings

{
    "mappings": {
        "properties": {
            "firstname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    },
                    "ngram_analyzer": {
                        "type": "text",
                        "analyzer": "ngram_analyzer",
                        "index_options": "docs"
                    }
                }
            },
            "fullname": {
                "type": "text",
                "fields": {
                    "edge_ngram_analyzer": {
                        "type": "text",
                        "analyzer": "edge_ngram_analyzer"
                    },
                    "keyword": {
                        "type": "keyword"
                    },
                    "ngram_analyzer": {
                        "type": "text",
                        "analyzer": "ngram_analyzer",
                        "index_options": "docs"
                    }
                }
            },
            "home_address1": {
                "type": "text",
                "fields": {
                    "edge_ngram_analyzer": {
                        "type": "text",
                        "analyzer": "edge_ngram_analyzer",
                        "index_options": "docs"
                    },
                    "keyword": {
                        "type": "keyword"
                    },
                    "ngram_analyzer": {
                        "type": "text",
                        "analyzer": "ngram_analyzer",
                        "index_options": "docs"
                    }
                }
            },
            "home_city": {
                "type": "text",
                "fields": {
                    "edge_ngram_analyzer": {
                        "type": "text",
                        "analyzer": "edge_ngram_analyzer",
                        "index_options": "docs"
                    },
                    "keyword": {
                        "type": "keyword"
                    },
                    "ngram_analyzer": {
                        "type": "text",
                        "analyzer": "ngram_analyzer",
                        "index_options": "docs"
                    }
                }
            },
            "home_state": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    },
                    "ngram_analyzer": {
                        "type": "text",
                        "analyzer": "ngram_analyzer",
                        "index_options": "docs"
                    }
                }
            },
            "home_zip": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    },
                    "ngram_analyzer": {
                        "type": "text",
                        "analyzer": "ngram_analyzer"
                    }
                }
            },
            "lastname": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    },
                    "ngram_analyzer": {
                        "type": "text",
                        "analyzer": "ngram_analyzer",
                        "index_options": "docs"
                    }
                }
            },
            "middlename": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            }
        }
    },
    "settings": {
        "index": {
            "max_ngram_diff": "20",
            "number_of_shards": "1",
            "similarity": {
                "default": {
                    "type": "scripted",
                    "script": {
                        "source": "return doc.freq > 0 ? 1 : 0;"
                    }
                }
            },
            "analysis": {
                "analyzer": {
                    "edge_ngram_analyzer": {
                        "filter": [
                            "lowercase"
                        ],
                        "type": "custom",
                        "tokenizer": "edge_ngram_tokenizer"
                    },
                    "ngram_analyzer": {
                        "filter": [
                            "lowercase"
                        ],
                        "type": "custom",
                        "tokenizer": "ngram_tokenizer"
                    }
                },
                "tokenizer": {
                    "edge_ngram_tokenizer": {
                        "token_chars": [
                            "letter",
                            "digit",
                            "custom",
                            "whitespace"
                        ],
                        "custom_token_chars": "'-",
                        "min_gram": "1",
                        "type": "edge_ngram",
                        "max_gram": "20"
                    },
                    "ngram_tokenizer": {
                        "token_chars": [
                            "letter",
                            "digit",
                            "custom"
                        ],
                        "custom_token_chars": "'-",
                        "min_gram": "2",
                        "type": "ngram",
                        "max_gram": "5"
                    }
                }
            },
            "number_of_replicas": "1"
        }
    }
}

A sample query...

{
    "query": {
        "bool": {
            "must": [
                {
                    "query_string": {
                        "query": "chris cl",
                        "fields": [
                            "fullname.ngram_analyzer^1.0",
                            "fullname.edge_ngram_analyzer^1.0",
                            "home_address1.edge_ngram_analyzer^1.0",
                            "home_address1.ngram_analyzer^1.0",
                        ],
                        "type": "most_fields",
                        "default_operator": "or",
                        "boost": 1
                    }
                }
            ],
            "should": [
                {
                    "multi_match": {
                        "query": "chris cl",
                        "fields": [
                            "fullname^1.0",
                            "home_address1^1.0",
                        ],
                        "type": "most_fields",
                        "operator": "OR",
                        "boost": 1
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "_source": {
        "includes": [
            "fullname",
            "firstname",
            "lastname",
            "home_address1",
            "home_city"
        ]
    }

Here are the first two results from this query...

        "_id": "1",
        "_score": 28.0,
        "_source": {
            "home_address1": "613 S Chris Ln",
            "firstname": "Chris",
            "home_city": "MOUNT PROSPECT",
            "fullname": "Chris Huang",
            "lastname": "Huang"
        }
    },
    {
        "_id": "2",
        "_score": 22.0,
        "_source": {
            "home_address1": "719 W Truitt Ave",
            "firstname": "Chris",
            "home_city": "Chillicothe",
            "fullname": "Chris Clark",
            "lastname": "Clark"
        }
    }
GNG
  • 1,341
  • 2
  • 23
  • 50

0 Answers0