2

I want to configure Elasticsearch, so that searching for "JaFNam" will create a good score for "JavaFileName".

I'm tried to build an analyzer, that combines a CamelCase pattern analyzer with an edge_ngram tokenizer. I thought this would create terms like these:

J F N Ja Fi Na Jav Fil Nam Java File Name

But the tokenizer seems not to have any effect: I keep getting these terms:

Java File Name

What would the correct Elasticsearch configuration look like?


Example code:

curl -XPUT    'http://127.0.0.1:9010/hello?pretty=1' -d'
{
  "settings":{
    "analysis":{
      "analyzer":{
        "camel":{
          "type":"pattern",
          "pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
          "filters": ["edge_ngram"]
        }
      }
    }
  }
}
'
curl -XGET    'http://127.0.0.1:9010/hello/_analyze?pretty=1' -d'
{
  "analyzer":"camel",
  "text":"JavaFileName"
}'

results in:

{
  "tokens" : [ {
    "token" : "java",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "file",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "name",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "word",
    "position" : 2
  } ]
}
slartidan
  • 20,403
  • 15
  • 83
  • 131
  • You can only have a single tokenizer, either `pattern` or `edge_ngram` but not both at the same time. Besides, I'm not sure why the character case should make any difference. How different is it from searching `JaFNam` or `jafnam`? – Val Jan 18 '17 at 11:52
  • @Val eclipse and IntelliJ IDEs use that kind of "case interpretation". They interpret `JaFNam` and `jafnam` differently. I want to use the same behavior for my search. – slartidan Jan 18 '17 at 12:26
  • @Val Can I use a `edge_ngram` *filter* instead of an `edge_ngram` tokenizer to achieve the desired behaviour? – slartidan Jan 18 '17 at 12:28
  • yes there's also an `edge_ngram` filter, you can try – Val Jan 18 '17 at 12:36
  • @Val Sadly I get the same results when using a filter. I updated my question to use filter instead of tokenizer. – slartidan Jan 18 '17 at 12:54
  • You analyzer definition is not correct. you need a `tokenizer` and an array of `filter`, as it is your analyzer doesn't work. – Val Jan 18 '17 at 13:09

1 Answers1

3

You analyzer definition is not correct. you need a tokenizer and an array of filter, as it is your analyzer doesn't work. Try like this instead:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "tokenizer": "my_pattern",
          "filter": [
            "my_gram"
          ]
        }
      },
      "filter": {
        "my_gram": {
          "type": "edge_ngram",
          "max_gram": 10
        }
      },
      "tokenizer": {
        "my_pattern": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360