0

I'm looking for a better way.

I have an arbitrary number of input terms (let's say they are last names) from the user. I want to perform a prefix search on each one and boost score for any matches.

The not-analyzed prefix query is what I'm using now -- see example below -- however I'm also taking on the responsibility of what an analyzer would otherwise do by creating custom programming to split apart the input terms on whitespace, trimming, lower-casing them, and then constructing a series of prefix queries with the tokens to boost scoring like so.

  1. Example input is a bunch of last names like "Smith, Rodriguez, ROBERTS, doe".
  2. Then my programming parses them into tokens and lowercases them into:
    smith
    rodriguez
    roberts
    doe

  3. Finally I construct multiple prefix queries to boost the score like so

"should": [
  {
      "dis_max" : {
          "tie_breaker": 1,
          "boost": 3,
          "queries": [
              {
                  "prefix" : { "surname": "doe"}
              },
                {
                  "prefix" : { "surname": "rob"}
              },
                {
                  "prefix" : { "surname": "rod"}
              },
                {
                  "prefix" : { "surname": "smi"}
              }
          ]
      }
  }
],

I can't help but think I'm doing this in an inefficient manner and that elasticsearch might provide smarter features that I don't know about. I wish for an analyzed form of the prefix query to make my life easy. For example it would be ideal to pass input verbatim for analysis to an elastic query like so "someAnalyzedPrefix": {"surname": "smith rodriguez roberts doe", prefix_length: 3} I'm dreaming a bit here, but you get the gist of the fact I'm looking for a more curt solution.

I wonder if any other kinds of queries can effect the same outcome while taking responsibility for analysis.

Suggestions for improvement are all welcome, otherwise I'll be sticking with the pattern above as it does meet the need although not necessarily beautifully.

John K
  • 28,441
  • 31
  • 139
  • 229

1 Answers1

0

I think the Edge NGram tokenizer / filter will help.

You can have an index with index only and search only analyzers. The index analyzer will just lowercase and make edge ngrams. Search analyzers has a Word Delimiter filter which will take care of parsing your query. Note that you can omit Word Delimiter filter and just use a Standard tokenizer instead of Whitespace and it will take care of splitting it on whitespace and commas. Word delimiter gives you more control on how you want to split the tokens.

You can always use the _analyze api to test how your tokenization will work.

Index Settings:

{
    "settings" : {
        "analysis" : {
          "filter": {
            "word_delimiter_filter": {
                  "preserve_original": "true",
                  "catenate_words": "true",
                  "catenate_all": "true",
                  "split_on_case_change": "true",
                  "type": "word_delimiter",
                  "catenate_numbers": "true",
                  "stem_english_possessive": "false"
            },
            "edgengram_filter": {
                    "type":     "edge_ngram",
                    "min_gram": 3,
                    "max_gram": 3
            }
        },
        "analyzer" : {
                "my_edge_ngram_analyzer" : {
                    "filter": [
                        "lowercase",
                        "edgengram_filter"
                    ],
                    "type": "custom",
                    "tokenizer" : "whitespace"
                },
                "my_edge_ngram_search_analyzer": {
                  "filter": [
                    "lowercase",
                    "word_delimiter_filter",
                    "edgengram_filter"
                  ],
                  "type": "custom",
                  "tokenizer": "whitespace"
                }
            }
        }
    }
}

Mapping:

{
      "properties": {
        "surname_edgengrams": {
            "type": "string",
            "analyzer": "my_edge_ngram_analyzer",
            "search_analyzer": "my_edge_ngram_search_analyzer"
        },
        "surname": {
          "type": "string",
          "index": "not_analyzed",
          "copy_to": [
              "surname_edgengrams"
            ]
        }
      }  
}

I indexed some documents using bulk api:

{ "index" : { "_index" : "edge_test", "_type" : "test_mapping", "_id" : "1" } }
{ "surname" : "Smith" }
{ "index" : { "_index" : "edge_test", "_type" : "test_mapping", "_id" : "2" } }
{ "surname" : "Rodriguez" }
{ "index" : { "_index" : "edge_test", "_type" : "test_mapping", "_id" : "3" } }
{ "surname" : "Roberts" }
{ "index" : { "_index" : "edge_test", "_type" : "test_mapping", "_id" : "4" } }
{ "surname" : "Doe" }

And use the following search template:

{
    "query" : {
        "bool" : {
            "should" : [{
                    "match" : {
                        "surname_edgengrams" : {
                            "query" : "Smith, Rodriguez, ROBERTS, doe",
                            "boost" : 3
                        }
                    }
                }
            ]
        }
    }
}

Results:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.14085768,
    "hits": [
      {
        "_index": "edge_test",
        "_type": "test_mapping",
        "_id": "1",
        "_score": 0.14085768,
        "_source": {
          "surname": "Smith"
        }
      },
      {
        "_index": "edge_test",
        "_type": "test_mapping",
        "_id": "3",
        "_score": 0.14085768,
        "_source": {
          "surname": "Roberts"
        }
      },
      {
        "_index": "edge_test",
        "_type": "test_mapping",
        "_id": "2",
        "_score": 0.13145615,
        "_source": {
          "surname": "Rodriguez"
        }
      },
      {
        "_index": "edge_test",
        "_type": "test_mapping",
        "_id": "4",
        "_score": 0.065728076,
        "_source": {
          "surname": "Doe"
        }
      }
    ]
  }
}
jay
  • 2,067
  • 2
  • 16
  • 31