0

Using Elasticsearch 2.2, as a simple experiment, I want to remove the last character from any word that ends with the lowercase character "s". For example, the word "sounds" would be indexed as "sound".

I'm defining my analyzer like this:

{
  "template": "document-index-template",
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "sFilter": {
          "type": "pattern_replace",
          "pattern": "([a-zA-Z]+)([s]( |$))",
          "replacement": "$2"
        }
      },
      "analyzer": {
        "tight": {
          "type": "standard",
          "filter": [
            "sFilter",
            "lowercase"
          ]
        }
      }
    }
  }
}

Then when I analyze the term "sounds of silences" using this request:

<index>/_analyze?analyzer=tight&text=sounds%20of%20silences

I get:

{
   "tokens": [
      {
         "token": "sounds",
         "start_offset": 0,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 0
      },
      {
         "token": "of",
         "start_offset": 7,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "silences",
         "start_offset": 10,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

I am expecting "sounds" to be "sound" and "silences" to be "silence"

Redtopia
  • 4,947
  • 7
  • 45
  • 68
  • Are you doing this for academic purposes or for practical language analysis? If you're trying to get better English language tokenising, there's an [analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html) for that. – Ant P Jun 16 '16 at 18:49

1 Answers1

2

The above analyzer setting is invalid .I think what you intended to use is an analyzer of type custom with tokenizer set to standard

Example:

{
 
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "sFilter": {
          "type": "pattern_replace",
          "pattern": "([a-zA-Z]+)s$",
          "replacement": "$1"
        }
      },
      "analyzer": {
        "tight": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "sFilter"
          ]
        }
      }
    }
  }
}
keety
  • 17,231
  • 4
  • 51
  • 56
  • That response is incorrect. That will remove all "s" characters that come after any other character, and you explicitly defined that should be at the end of the word. You have to add a `$` at the end of the regexp. `"pattern": "([a-zA-Z]+)s$"` – mmoreram Aug 16 '22 at 10:55
  • @mmoreram you are correct updated the answer. – keety Sep 13 '22 at 14:59