1

Say each document in my elasticsearch index is a blog post which consists of only two fields, title and tags. The title field is just a string while tags is a multi value field.

If I have three documents like this:

title      tags
"blog1"    [A,B,C]
"blog2"    [A,B]
"blog3"    [B,C]

I would like to bucket by the unique values of all possible tags, but how can I get results like below, which contains three items in a bucket. Or is there any efficient alternatives?

{A: ["blog1", "blog2"]}
{B: ["blog1", "blog2", "blog3"]}
{C: ["blog1", "blog3"]}

It would be nice if someone can provide an answer in elasticsearch python API.

1 Answers1

2

You can simply use a terms aggregation on the tags field and another nested top_hits sub-aggregation. With the following query, you'll get the expected results.

{
    "size": 0,
    "aggs": {
        "tags": {
            "terms": { 
                "field": "tags" 
            },
            "aggs": {
                "top_titles": {
                    "top_hits": {
                        "_source": ["title"]
                    }
                }
            }
        }
    }
}

Using this with Python is straightforward:

from elasticsearch import Elasticsearch
client = Elasticsearch()

response = client.search(
    index="my-index",
    body= {
    "size": 0,
    "aggs": {
        "tags": {
            "terms": { 
                "field": "tags" 
            },
            "aggs": {
                "top_titles": {
                    "top_hits": {
                        "_source": ["title"]
                    }
                }
            }
        }
    }
}
)

# parse the tags
for tag in response['aggregations']['tags']['buckets']:
    tag = tag['key'] # => A, B, C
    # parse the titles for the tag
    for hit in tag['top_titles']['hits']['hits']:
       title = hit['_source']['title'] # => blog1, blog2, ...
Val
  • 207,596
  • 13
  • 358
  • 360