2

I'm playing with Elasticsearch (v 1.7.3, with Java Transport Client) to search a human names database. I'm leveraging a bunch of available phonetic algorithms for that (DoubleMetaphone, RefinedSoundex etc) to index my name fields and store them. However, the scoring algorithm I need is to compute the percentage of closeness of the input token to the one in the index.

For example:

The following document, when gets indexed using the phonetic algorithms:

{
  "FullName": "Christopher Cruickshank"
}

Is expanded as (output taken using the analyze api):

{
  "tokens": [
    {
      "token": "C3090360109",
      "start_offset": 0,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "christopher",
      "start_offset": 0,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "K3936",
      "start_offset": 0,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "KRST",
      "start_offset": 0,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "C3903083",
      "start_offset": 12,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "cruickshank",
      "start_offset": 12,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "K3935",
      "start_offset": 12,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "KRKX",
      "start_offset": 12,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Now during search time, when I query for:

{
              "match": {
                "FullName": {
                  "query": "Cristopher Krukshank",
                  "boost": 10.0
                }
              }
            }

What I'd like to do is to score the results based on the number of matched tokens from the index.

ie:

(Number of matched tokens per term / Total number of expanded tokens per term) * Boost

Although this could work conceptually, I'm wondering if there is any better way of achieving the same.

Also, I'm inclined to push much of complexity and logic during index time (either by storing the count of total tokens in a field) so my search logic will be simpler. If this is a reasonable approach, then I would like to know if there are any technical implications of using the analyze api during the indexing process especially when bulk indexing is used for millions of names. I'm guessing the Analyze API will be called for every original token and each it's expanded tokens (which can potentially be huge!).

If this is not a reasonable approach at all, then please can someone throw some pointers or share some experiences?

The other option I'm also thinking is to call the analyze api during query time and send the query to elasticsearch with the "explain" option and then do a string match in the explain section to work out how many tokens matched.

user1189332
  • 1,773
  • 4
  • 26
  • 46
  • Do you mean something similar to what I described in http://stackoverflow.com/questions/39100218/scoring-based-on-number-of-matching-terms – Antoni Myłka Aug 23 '16 at 14:01
  • 1
    Were you able to solve this ? I don't want to add another network call by calling the analyse API before searching – Aditya Pawade Jun 05 '18 at 19:26

1 Answers1

0

We did this in an indirect way. I'm trying to find a better way and saw your post.

The solution is when searching for "Cristopher Krukshank", the first hit for example is:

"Cristopher Krukshank Jr." with score of 10.0

Then you take the first result "Cristopher Krukshank Jr." and search it again. Of course the first result will be "Cristopher Krukshank Jr.", but with a higher score, for example "20.0".

So you know the max score is 20, then for the partial match, the final score is "first score/max score" which is 10/20 = 0.5. The final score will be a value between 0~1. 1 means an exact match.

One problem is the input could be a token which is hitting on anything. For example for "Cristopher Krukshank XXXXX", XXXXX may not be token on index. So to make it right, we have to use the number of tokens to recalculate the score.

ouflak
  • 2,458
  • 10
  • 44
  • 49
Steven Wu
  • 11
  • 2