1

I have created a small example to demonstrate the specific issue I'm having. Briefly, when I create a multi-field mapping using a field type of Text and the Keyword analyzer, no documents are returned from an Elasticsearch Regexp search query that contains punctuation. I use a dash in the following example to demonstrate the problem.

I’m using Elasticsearch 7.10.2. The index I’m targeting is already populated with millions of documents. The field of type Text where I need to run some regular expressions uses the Standard (default) analyzer. I understand that, because the field gets tokenized by the Standard analyzer, the following request:

POST _analyze
{
  "analyzer" : "default",
  "text" : "The number is: 123-4576891-73.\n\n"
}

will yield three words: "the", "number", "is" and three groups of numbers: "123", "4567891", "73". It's obvious that a regular expression that relies on punctuation, like this one that contains two literal dashes:

"(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?" 

will not return a result. Note, for those not familiar with this, regex shortcuts do not work for Lucene-based Elasticsearch requests (at least not yet). Here's a reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html. Also, the use of word boundaries that I show in my examples (.*[^a-z0-9_])? and ([^a-z0-9_].*)? are from this post: Word boundary in Lucene regex.

To see this for yourself with an example, create and populate an index like so:

PUT /index-01
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "text": { "type": "text" }
    }
  }
}
POST index-01/_doc/
{
  "text": "The number is: 123-4576891-73.\n\n"
}

The following Regexp search query will return nothing because of the tokenization issue I described earlier:

POST index-01/_search
{
  "size": 1,
  "query": {
    "regexp": {
      "text": {
        "value": "(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?",
        "flags": "ALL",
        "case_insensitive": true,
        "max_determinized_states": 100000
      }
    }
  },
  "_source": false,
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}

Most posts suggest a quick fix would be to target the Keyword type multi-field instead of the text field. The Keyword multi-type field gets created automatically, as this shows:

GET index-01/_mapping/field/text

response:

{
  "index-01" : {
    "mappings" : {
      "text" : {
        "full_name" : "text",
        "mapping" : {
          "text" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

Targeting the keyword field, I get return results for the following Regexp search query:

POST index-01/_search
{
  "size": 1,
  "query": {
    "regexp": {
      "text.keyword": {
        "value": "(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?",
        "flags": "ALL",
        "case_insensitive": true,
        "max_determinized_states": 100000
      }
    }
  },
  "_source": false,
  "highlight": {
    "fields": {
      "text.keyword": {}
    }
  }
}

here's the hit-highlighted part of the result:

...
         "highlight" : {
          "text.keyword" : [
            "<em>This is my number 123-4576891-73. Thanks\n\n</em>"
          ]
        }
...

Because some of the documents have a large amount of text, I adjusted the text.keyword field size with ignore_above parameter:

PUT /index-01/_mapping
{
  "properties": {
    "text": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 32766
        }
      }
    }
  }
}

However, this will skip some documents since the targeted index, contains larger text fields than this upper-bound for a field type Keyword. Also, according to the Elasticsearch documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html, this type of field is really designed for structured data, constant values and wildcard queries.

Following that guidance, I assigned the Keyword analyzer to a new field type Text (text.raw) by making this update to the mapping:

PUT /index-01/_mapping
{
  "properties": {
    "text": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 32766
        },
        "raw": {
          "type": "text",
          "analyzer": "keyword",
          "index": true
        }
      }
    }
  }
}

Now, you can see the additional mapping text.raw with this request:

GET index-01/_mapping/field/text

response:

{
  "index-01" : {
    "mappings" : {
      "text" : {
        "full_name" : "text",
        "mapping" : {
          "text" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 32766
              },
              "raw" : {
                "type" : "text",
                "analyzer" : "keyword"
              }
            }
          }
        }
      }
    }
  }
}

Next, I verified that the data was, in fact, mapped to the multi-fields:

POST index-01/_search
{
  "query": 
  {
    "match_all": {}
  },
  "fields": ["text", "text.keyword", "text.raw"]
}

response:

...
    "hits" : [
      {
        "_index" : "index-01",
        "_type" : "_doc",
        "_id" : "2R-OgncBn-TNB4PjXYAh",
        "_score" : 1.0,
        "_source" : {
          "text" : "The number is: 123-4576891-73.\n\n"
        },
        "fields" : {
          "text" : [
            "The number is: 123-4576891-73.\n\n"
          ],
          "text.keyword" : [
            "The number is: 123-4576891-73.\n\n"
          ],
          "text.raw" : [
            "The number is: 123-4576891-73.\n\n"
          ]
        }
      }
    ]
...

I also verified that the Keyword analyzer applied to the text.raw field contains a single token, as shown in the following request:

POST _analyze
{
  "analyzer" : "keyword",
  "text" : "The number is: 123-4576891-73.\n\n"
}

response:

{
  "tokens" : [
    {
      "token" : "The number is: 123-4576891-73.\n\n",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}

However, the exact same Regexp search query targeting the text.raw field returns nothing:

POST index-01/_search
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "regexp": {
            "text.raw":   {
              "value": "(.*[^a-z0-9_])?[0-9]{3}-[0-9]{7}-[0-9]{2}([^a-z0-9_].*)?",
              "flags": "ALL",
              "case_insensitive": true,
              "max_determinized_states": 100000
            }
          }
        }
      ]
    }
  },
  "_source": false,
  "highlight" : {
    "fields" : {
        "text.raw": {}
    }
  }
}

Please let me know if you know why I'm not getting back a result using the field type Text with the Keyword analyzer.

ewilan
  • 638
  • 6
  • 16

0 Answers0