ElasticSearch - Query the data on a field that matches from first position

Question

I searched alot on this and tried numerous combinations. But failed in all attempts :(.

Here is my problem: I created a jdbc-river in elastic search as below:

{
    "type" : "jdbc",
    "jdbc" : {
        "driver" : "oracle.jdbc.driver.OracleDriver",
        "url" : "jdbc:oracle:thin:@//ip:1521/db",
        "user" : "user",
        "password" : "pwd",
        "sql" : "select f1, f2, f3 from table"
    },
    "index" : {
        "index" : "subject2",
        "type" : "name2",
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_analizer": {
                        "type": "custom",
                        "tokenizer": "my_pattern_tokenizer",
                        "filter": []
                    }
                },
                "tokenizer": {
                    "my_pattern_tokenizer": {
                        "type": "pattern",
                        "pattern": "$^"
                    }
                },
                "filter": []
            }
        }
    },
    "mappings": 
    {
        "subject2": 
        {
            "properties" : {
                "f1" : {"index" : "not_analyzed", "store": "yes", "analyzer": "my_analizer", "search_analyzer": "keyword", "type": "string"},
                "f2" : {"index" : "not_analyzed", "store": "yes", "analyzer": "my_analizer", "search_analyzer": "keyword", "type": "string"},
                "f3" : {"index" : "not_analyzed", "store": "yes", "analyzer": "my_analizer", "search_analyzer": "keyword", "type": "string"}
            }
        }
    }
}

I want to implement an auto-complete feature that matches the user entered value with the data in "f1" field say as of now but from the start.

Data in the f1 field is like

"Hardin County ABC"
"Country of XYZ"
"County of Blah blah"
"County of Blah second"

What is as per requirement is when user types "Coun" then result 2nd, 3rd and 4th should be returned by the elastic search and not the first. I read about "keyword" analyzer that makes the complete word to be token but I don't know not working in this case.

Also, if user types "County of B" then 3rd and 4th option should be returned by the elastic search.

Below is the format of my querying the result. Option 1

{"from":0,"size":10, "query":{ "field" : { "f1" : "count*" } } }

Option 2

{"from":0,"size":10, "query":{ "span_first" : {
        "match" : {
            "span_term" : { "COMPANY" : "hardin" }
        },
        "end" : 1
    } } }

Please tell me what wrong I am doing here? Thanks in advance.

Did either of the answers work for you, if so please accept one! If not can you give us details as to why not. Thanks — ramseykhalaf, Aug 18 '13 at 06:22

score 1 · Accepted Answer · answered Aug 15 '13 at 11:00

1

Before I answer I want to point out you are defining an analyzer then setting index: not_analyzed which means the analyzer is not used. (If you use not_analyzed it is the same as using the keyword analyzer, the whole string, untouched, is one token.)

Also analyzer: my_analizer is a shortcut for index_analyzer: my_analizer and search_analyzer: my_analizer, so your mapping is a bit confusing to me...

Also the fields will be stored in the _source unless you turn this off, you don't need to store the fields separately unless you turn off the _source storing and need that field returned in the result set.

There are 2 ways I can think of doing this:

1. Use a `match_phrase_prefix` query - Easier and slow

Don't define any analyzers, you don't need them.

Mapping:

"subject2": {
    "properties" : {
        "f1" : { "type": "string" },
        "f2" : { "type": "string" },
        "f3" : { "type": "string" },
        }
    }
}

Query:

"match_phrase_prefix" : {
    "f1" : {
        "query" : "Count"
    }
}

2. Use an `edge_ngram` token filter - Harder and faster

"settings": {
    "analysis": {
        "analyzer": {
            "edge_autocomplete": {
                "type": "custom",
                "tokenizer": "keyword",
                "filter": ["my_edge_ngram"]
            }
        },
        "filter" : {
            "my_edge_ngram" : {
               "type": "edgeNGram",
               "min_gram": 2,
               "max_gram": 15
            }
        }
    }
}

Mapping:

"subject2": {
    "properties" : {
        "f1" : { "type": "string", "index": "edge_autocomplete" },
        "f2" : { "type": "string", "index": "edge_autocomplete" },
        "f3" : { "type": "string", "index": "edge_autocomplete" },
        }
    }
}

Query:

"match" : {
    "f1" : "Count",
    "analyzer": "keyword"
}

Good luck!

answered Aug 15 '13 at 11:00

ramseykhalaf

3,371
2
17
16

You would need to combine 1 and 2 to get the desired outcome, right? – Scott Rice Aug 15 '13 at 13:54
No, that's why I said "There are 2 ways I can think of doing this:". The issue with the first is the `match_phrase_prefix` query will expand into a large boolean query. Have a look [at this question to see how lucene queries are rewritten](http://stackoverflow.com/questions/14059985/how-to-improve-a-single-character-prefixquery-performance) – ramseykhalaf Aug 15 '13 at 14:00
I needed to read [this](http://euphonious-intuition.com/2013/02/starts-with-phrase-matching-in-elasticsearch/) - I didn't understand that match_phrase_query would return word fragments without first tokenizing. – Scott Rice Aug 15 '13 at 14:08
Exactly, while that's great out the box, it will be slower that tokenizing at index-time. Unless you do **far** more indexing than searching... – ramseykhalaf Aug 15 '13 at 14:12
It didn't worked. For the first one I already tried match prefix but few options are still coming on top those are not starting with the word. Actually in my db there are more than 4 million rows. For second one you are suggesting gram which again prepare tokens out of my string :( Here is my query {"from":0,"size":10, "query":{ "match" : { "COMPANY" : {"query" : "country", "analyzer":"keyword", "boost": 2.2, "type" : "phrase_prefix" } } } } – shaILU Aug 16 '13 at 12:18
Can you please give me a little more info about why the first isn't working. Can you update your question with the query and results, **then** give your desired results. From what I read in your question, I am surprised the `match_phrase_prefix` isn't working. – ramseykhalaf Aug 16 '13 at 12:35
For the first option below is my query: { "from" : 0, "query" : { "match" : { "f1" : { "boost" : 2, "query" : "Coun", "type" : "phrase_prefix" } } }, "size" : 10 } and result is not coming with all the rows has "Coun" at start on top. There are few rows which are coming in between. Result is: "COUNTRY COMPANY" "HURON, COUNTY OF (572)" "CHILTON, COUNTY OF (101)" "BLACKMOUNT COUNTRY CLUB" "RIVER HILLS COUNTRY CLUB" "COUNTRY CLUB MARBLE & TILE" – shaILU Aug 19 '13 at 06:37
One more thing for reference: When I use this query {"from":0,"size":10, "query":{ "match" : { "f1" : {"query" : "MICROSOURCE", "type" : "phrase_prefix", "boost":2 } } } } it works for single word (although sorting is still an issue), but as soon as I give space and few chars of second words it stop giving me any results. – shaILU Aug 19 '13 at 09:12

Scott Rice · Answer 2 · 2013-08-14T18:51:27.097

Have you tried an ngram filter? It will tokenize strings of character-length "n". So, your mapping could look like:

  {
    "settings": {
        "analysis": {
            "analyzer": {
                "autocomplete": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["standard", "lowercase", "kstem", "ngram"]
                }
            },
            "filter" : {
                "ngram" : {
                   "type": "ngram",
                   "min_gram": 2,
                   "max_gram": 15
                }
            }
        }
    },
    "mappings": {
        "subject2": {
            "properties" : {
                 "f1" : {
                    "type": "multi_field",
                     "fields": {
                         "f1": {
                             "type": "string"
                         },
                         "autocomplete": {
                             "analyzer": "autocomplete", 
                             "type": "string"
                         },
...

This will return the ngram "count" for the 2nd, 3rd, and 4th results, which should give you the desired outcome.

Note that making "f1" a multi_field field is not required. However, when you don't need the "autocomplete" analyzer, such as when returning "f1" in the search results, then it is less expensive to use the "f1" subfield. If you do use a "multi_field", you can access "f1" at "f1" (without dot notation), but to access "autocomplete" you need to use dot notation - so "f1.autocomplete".

He says that he doesn't want a search for "Count" to return "Hardin County ABC" as a result. Wouldn't your proposed solution also return that? (Sorry if I missed something.) — ramseykhalaf, Aug 15 '13 at 10:41

score 0 · Answer 3 · answered Aug 19 '13 at 12:00

Although, The solution we final implemented is a mix of approaches but still answer by "ramseykhalaf" is the closest match. +1 to him.

What I did when ever user enters a word with space fire a match-prefix query and get the closest match result to show.

{"from":0,"size":10, "query":{ "match" : { "f1" : {"query" : "MICROSOU", "type" : "phrase_prefix", "boost":2} } } }

As soon as user hits any character after space I change the mode of query to query field with regex and being multiple words in a field match is again very close to what user is looking for.

{"from":0,"size":10, "query":{ "query_string" : { "default_field":"f1","query" : "micro int*", "boost":2 } } }

In this way we got the closest solution to this requirement. I would be happy to get more optimize solution that suffice my above mentioned use cases.

Just to add one more thing - now the river I created is simple plain vanilla with fields as "not_analyzed" and analyzer as "keyword"

ElasticSearch - Query the data on a field that matches from first position

3 Answers3

1. Use a match_phrase_prefix query - Easier and slow

2. Use an edge_ngram token filter - Harder and faster

1. Use a `match_phrase_prefix` query - Easier and slow

2. Use an `edge_ngram` token filter - Harder and faster