0

I have a string I'd like to index as keyword type but with a special comma analyzer: For example:

"San Francisco, Boston, New York" -> "San Francisco", "Boston, "New York"

should be both indexed and aggregatable at the same time so that I can split it up by buckets. In pre 5.0.0 the following worked: Index settings:

{
     'settings': {
         'analysis': {
             'tokenizer': {
                 'comma': {
                     'type': 'pattern',
                     'pattern': ','
                 }
             },
             'analyzer': {
                'comma': {
                     'type': 'custom',
                     'tokenizer': 'comma'
                 }
             }
         },
     },
}

with the following mapping:

{
    'city': {
        'type': 'string',
        'analyzer': 'comma'
    },
}

Now in 5.3.0 and above the analyzer is no longer a valid property for the keyword type, and my understanding is that I want a keyword type here. How do I specify an aggregatable, indexed, searchable text type with custom analyzer?

Oleksiy
  • 6,337
  • 5
  • 41
  • 58
  • In 5.2 and above, `keyword` fields can now have [normalizers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalizers.html), but those only allow specific token filters and char filters, but no tokenizer, so that's not an approach. Do you have any way to split that string on the client side before sending it to ES? – Val Apr 18 '17 at 03:44

2 Answers2

2

You can use multifields to index the same fields in two different ways one for searching and other for aggregations.

Also i suugest you to add a filter for trim and lowercase the tokens produced to help you with better search.

Mappings

PUT commaindex2
    {
        "settings": {
            "analysis": {
                "tokenizer": {
                    "comma": {
                        "type": "pattern",
                        "pattern": ","
                    }
                },
                "analyzer": {
                    "comma": {
                        "type": "custom",
                        "tokenizer": "comma",
                        "filter": ["lowercase", "trim"]
                    }
                }
            }
        },
        "mappings": {
            "city_document": {
                "properties": {
                    "city": {
                        "type": "keyword",
                        "fields": {
                            "city_custom_analyzed": {
                                "type": "text",
                                "analyzer": "comma",
                                "fielddata": true
                            }
                        }
                    }
                }
            }
        }
    }

Index Document

POST commaindex2/city_document
{
  "city" : "san fransisco, new york, london"
}

Search Query

POST commaindex2/city_document/_search
{
    "query": {
        "bool": {
            "must": [{
                "term": {
                    "city.city_custom_analyzed": {
                        "value": "new york"
                    }
                }
            }]
        }
    },
    "aggs": {
        "terms_agg": {
            "terms": {
                "field": "city",
                "size": 10
            }
        }
    }
}

Note

In case you want to run aggs on indexed fields, like you want to count for each city in buckets, you can run terms aggregation on city.city_custom_analyzed field.

POST commaindex2/city_document/_search
{
    "query": {
        "bool": {
            "must": [{
                "term": {
                    "city.city_custom_analyzed": {
                        "value": "new york"
                    }
                }
            }]
        }
    },
    "aggs": {
        "terms_agg": {
            "terms": {
                "field": "city.city_custom_analyzed",
                "size": 10
            }
        }
    }
}

Hope this helps

user3775217
  • 4,675
  • 1
  • 22
  • 33
  • What happens when you run the terms aggregation on the `city` field? You probably don't get each city separately, right? – Val Apr 18 '17 at 05:27
  • yeah then he can run on city.city_custom_analyzed field instead. He has not clearly mentioned where he wants to run aggregations. If thats the case then why he needs keyword, he can just apply custom analyzer and avoid multifields. Thats upto him he can choose any field. I will update the post. Thanks – user3775217 Apr 18 '17 at 05:39
1

Since you're using ES 5.3, I suggest a different approach, using an ingest pipeline to split your field at indexing time.

PUT _ingest/pipeline/city-splitter
{
  "description": "City splitter",
  "processors": [
    {
      "split": {
        "field": "city",
        "separator": ","
      }
    },
    {
      "foreach": {
        "field": "city",
        "processor": {
          "trim": {
            "field": "_ingest._value"
          }
        }
      }
    }
  ]
}

Then you can index a new document:

PUT cities/city/1?pipeline=city-splitter
{ "city" : "San Francisco, Boston, New York" }

And finally you can search/sort on city and run an aggregation on the field city.keyword as if the cities had been split in your client application:

POST cities/_search
{
  "query": {
     "match": {
         "city": "boston"
     }
  },
  "aggs": {
    "cities": {
      "terms": {
        "field": "city.keyword"
      }
    }
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360