2

Wonder if someone could help.

I've got an ElasticSearch index defined broadly as below:

{
  "properties": {
    "content": {
      "type": "string"
    },
    "topics": {
      "properties": {
        "topic_type": {
          "type": "string"
        },
        "topic": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}

So you end up with an entry in the index broadly along the lines of:

{
  "content": "some load of content",
  "timestamp": "some time stamp",
  "id": "some id",
  "topics": [
    {
      "topic": "safety",
      "topic_type": "Flight"
    },
    {
      "topic": "rockets",
      "topic_type": "Space"
    }
  ]
}

where each blob of content can have more than one topic associated with it.

What I'd like to be able to do is: aggregate by day a count of all the different "Space" topics E.g.:

April 1st:

  • "rockets": 20
  • "astronauts": 2
  • "aliens": 5

April 2nd:

  • "rockets": 10
  • "astronauts": 12
  • "aliens": 51

and so on.

What I've tried to do is something like:

curl -X POST 'http://localhost:9200/myindex/_search?search_type=count&pretty=true' -d '{
  "size": "100000",
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "myindex.topics.topic_type": "space"
          }
        }
      ]
    }
  },
  "aggs": {
    "articles_over_time": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "day"
      },
      "aggs": {
        "topics_over_time": {
          "terms": {
            "field": "topics.topic"
          }
        }
      }
    }
  }
}'

The problem with this is that although it just picks up those articles that have a topic_type of "space", some of those articles will have other "topics.topic" that get picked up in the "aggs" bit i.e. that do not have a topic_type of "space".

What I want to be able to do is to say "count & aggregate [group by essentially] those topics that are of topic type 'space'".

So with just this in the index:

{
  "content": "some load of content",
  "timestamp": "some time stamp",
  "id": "some id",
  "topics": [
    {
      "topic": "safety",
      "topic_type": "Flight"
    },
    {
      "topic": "rockets",
      "topic_type": "Space"
    }
  ]
}

It would be: rockets: 1

With these two in the index:

{
  "content": "some load of content",
  "timestamp": "some time stamp",
  "id": "some id",
  "topics": [
    {
      "topic": "safety",
      "topic_type": "Flight"
    },
    {
      "topic": "rockets",
      "topic_type": "Space"
    }
  ]
}

{
  "content": "some load of content2",
  "timestamp": "some time stamp",
  "id": "some id",
  "topics": [
    {
      "topic": "safety",
      "topic_type": "Flight"
    },
    {
      "topic": "rockets",
      "topic_type": "Space"
    },
    {
      "topic": "aliens",
      "topic_type": "Space"
    }
  ]
}

It would be: rockets: 2, aliens: 1 - but all grouped by day.

Not sure how to do this with ES.

If the index schema is not fit-for-purpose here, please do let me know what is (in your opinions).

Saeed Zhiany
  • 2,051
  • 9
  • 30
  • 41
Roland Dunn
  • 101
  • 3
  • 8
  • Can you include your settings? If you changed anything that is. And to make sure I understand the expected results, you're expecting aggregations on terms for topics.topic for the topic__type? – Michael at qbox.io Apr 24 '14 at 20:16
  • Settings are all default, and expected results are as you state. Aggregate by day, so per day, count of each topic per topic type. – Roland Dunn Apr 27 '14 at 10:50

0 Answers0