Exact Sub-String Match | ElasticSearch

Question

We are migrating our search strategy, from database to ElasticSearch. During this we are in need to preserve the existing functionality of partially searching a field similar the SQL query below (including whitespaces):

SELECT *
  FROM customer
 WHERE customer_id LIKE '%0995%';

Having said that, I've gone through multiple articles related to ES and achieving the said functionality. After the above exercise following is what I've come up with:

Majority of the article which I read recommended to use nGram analyzer/filter; hence following is how mapping & setting looks like:

Note: The max length of customer_id field is VARCHAR2(100).

{
   "customer-index":{
      "aliases":{
      },
      "mappings":{
         "customer":{
            "properties":{
               "customerName":{
                  "type":"text",
                  "fields":{
                     "keyword":{
                        "type":"keyword",
                        "ignore_above":256
                     }
                  }
               },
               "customerId":{
                  "type":"text",
                  "fields":{
                     "keyword":{
                        "type":"keyword",
                        "ignore_above":256
                     }
                  },
                  "analyzer":"substring_analyzer"
               }
            }
         }
      },
      "settings":{
         "index":{
            "number_of_shards":"3",
            "provided_name":"customer-index",
            "creation_date":"1573333835055",
            "analysis":{
               "filter":{
                  "substring":{
                     "type":"ngram",
                     "min_gram":"3",
                     "max_gram":"100"
                  }
               },
               "analyzer":{
                  "substring_analyzer":{
                     "filter":[
                        "lowercase",
                        "substring"
                     ],
                     "type":"custom",
                     "tokenizer":"standard"
                  }
               }
            },
            "number_of_replicas":"1",
            "uuid":"XXXXXXXXXXXXXXXXX",
            "version":{
               "created":"5061699"
            }
         }
      }
   }
}

Request to query the data looks like this:

{
  "from": 0,
  "size": 10,
  "sort": [
    {
      "name.keyword": {
        "missing": "_first",
        "order": "asc"
      }
    }
  ],
  "query": {
    "bool": {
      "filter": [
        {
          "query_string": {
            "query": "0995",
            "fields": [
              "customer_id"
            ],
            "analyzer": "substring_analyzer"
          }
        }
      ]
    }
  }
}

With that being said, here are couple of queries/issue:

Lets say there are 3 records with customer_id:

0009950011214, 0009900011214, 0009920011214

When I search for "0995". Ideally, I am looking forward to get only customer_id: 0009950011214.

But I get all three records as part of result set and I believe its due to nGram analyzer and the way it splits the string (note: minGram: 3 and maxGram:100). Setting maxGram to 100 was for exact match.

How should I fix this?

This brings me to my second point. Is using nGram analyzer for this kind of requirement the most effective strategy? My concern is the memory utilization of having minGram = 3 and maxGram = 100. Is there are better way to implement the same?

P.S: I'm on NEST 5.5.

soumitra goswami · Answer 1 · 2019-11-11T08:00:51.693

From the mapping I can see that the field customerId is a text/keyword field.( Difference between keyword and text in ElasticSearch ) So you can use a regex filter as shown below to make searches like the sql query you have given as example, Try this-

{
 "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [
            {
              "regexp": {
                            "customerId": {
                              "value": ".*0995.*",
                              "flags": "ALL"
                            }
                          }
            }
          ]
        }
      }
    }
  }
}

notice the "." in the value of the regex expression. ..* is same as contains search ~(..) is same as not contains You can also append ".*" at the starting or the end of the search term to do searches like Ends-with and Starts-with type of searches. Reference -https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-regexp-query.html

Thank you Soumitra for your response. Just one question as to your thoughts on performance impact of using regex filter when compared to nGram. — user2697452, Nov 11 '19 at 08:32
Using tokenisers will make the index size bigger, as you are breaking the values into tokens and storing them, but no such thing needs to happen when you use regexp filter (ref - https://stackoverflow.com/questions/17927728/will-using-ngram-tokenizer-and-ngram-filters-increase-the-index-size-in-elastics) but, apart from this, I am not sure if a search query will perform faster when using tokens or when using a regexp. — soumitra goswami, Nov 11 '19 at 09:47

score 0 · Answer 2 · answered Nov 11 '19 at 07:35

0

In your customerID field you can pass a "search_analyzer": "standard". Then in your search query remove the line "analyzer": "substring_analyzer".

This will ensure that the searched customerID is not tokenized into nGrams and is searched as is, while the customerIDs are indexed as nGrams. I believe that's the functionality that you were trying to replicate from your SQL query.

answered Nov 11 '19 at 07:35

Archit Saxena

1,527
13
26

Thank you Archit for your response. But after making the recommended changes, no results are getting returned; for the partial search. – user2697452 Nov 11 '19 at 08:31
can you try doing a simple match instead of the filter:query_string in your query. Also, in the query that you have posted, the customerID is spelt differently from the field in the mapping. – Archit Saxena Nov 11 '19 at 08:52

Exact Sub-String Match | ElasticSearch

2 Answers2