How to exclude asterisks while searching with analyzer

Question

I need to search by an array of values, and each value can be either simple text or text with askterisks(*). For example:

["MYULTRATEXT"]

And I have the next index(i have a really big index, so I will simplify it):

................
{
    "settings": {
         "analysis": {
            "char_filter": {
              "asterisk_remove": {
                "type": "pattern_replace",
                "pattern": "(\\d+)*(?=\\d)",
                "replacement": "1$"
              }
            },
            "analyzer": {
              "custom_search_analyzer": {
                "char_filter": [
                  "asterisk_remove"
                ],
                "type": "custom",
                "tokenizer": "keyword"
              }
            }
        }
    },
        "mappings": {
        "_doc": {
            "properties": {
               "name": {
                  "type": "text",
                  "analyzer":"keyword",
                  "search_analyzer": "custom_search_analyzer"
               },
     ......................

And all data in the index is stored with asterisks * e.g.:

curl -X PUT "localhost:9200/locations/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
   "name" : "MY*ULTRA*TEXT"
}

I need to return exact the same name value when I search by this string MYULTRATEXT

curl -XPOST 'localhost:9200/locations/_search?pretty' -d '
{
  "query": { terms: { "name": ["MYULTRATEXT"] }  }
}'

It Should return MY*ULTRA*TEXT, but it does not work, so can't find a workaround. Any thoughts?

I tried pattern_replace but seems like I am doing something wrong or I am missing something here.

So I need to replace all * to empty `` while searching

Assael Azran · Answer 1 · 2019-11-12T11:20:54.940

This might help you - your regex pattern is the issue.

You want to replace all * occurrences with `` the pattern below will do the trick..

PUT my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "my_analyzer", 
          "search_analyzer":"my_analyzer"
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "asterisk_remove": {
          "type": "pattern_replace",
          "pattern": "(?<=\\w)(\\*)(?=\\w)",
          "replacement": ""
        }
      },
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase",
            "asterisk_remove"
          ],
          "type": "custom",
          "tokenizer": "keyword"
        }
      }
    }
  }
}

Analyze query

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": ["MY*ULTRA*TEXT"]
}

Results of analyze query

{
"tokens": [
    {
      "token": "myultratext",
      "start_offset": 0,
      "end_offset": 13,
      "type": "word",
      "position": 0
    }
  ]
}

Post a document

POST my_index/doc/1
{
  "name" : "MY*ULTRA*TEXT"
}

Search query

GET my_index/_search
{
  "query": {
    "match": {
      "name": "MYULTRATEXT"
    }
  }
}

Or

GET my_index/_search
{
  "query": {
    "match": {
      "name": "myultratext"
    }
  }
}

Results search query

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "name": "MY*ULTRA*TEXT"
        }
      }
    ]
  }
}

Hope it helps

eemp · Accepted Answer · 2019-11-12T04:26:47.083

There appears to be a problem with the regex you provided and the replacement pattern.

I think what you want is:

            "char_filter": {
              "asterisk_remove": {
                "type": "pattern_replace",
                "pattern": "(\\w+)\\*(?=\\w)",
                "replacement": "$1"
              }
            }

Note the following changes:

\d => \w (match word characters instead of only digits)
escape * since asterisks have a special meaning for regexes
1$ => $1 ($<GROUPNUM> is how you reference captured groups)

To see how Elasticsearch will analyze the text against an analyzer, or to check that you defined an analyzer correctly, Elasticsearch has the ANALYZE API endpoint that you can use: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

If you try this API with your current definition of custom_search_analyzer, you will find that "MY*ULTRA*TEXT" is analyzed to "MY*ULTRA*TEXT" and not "MYULTRATEXT" as you intend.

I have a personal app that I use to more easily interact with and visualize the results of the ANALYZE API. I tried your example and you can find it here: Elasticsearch Analysis Inspector.

How to exclude asterisks while searching with analyzer

2 Answers2