4

I'm trying to run what started to look like a simple query on Elasticsearch, but I just can't seem to get the result I'm looking for.

Here's a brief example of what I'm trying to do:

I have a database of news. Each piece of news contains a source, a headline, a timestamp and a user.

I want the get the last (timestamp based) headline for each available source for a given user.

#!/bin/bash

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create indexes

curl -XPUT "$ELASTICSEARCH_ENDPOINT/news" -d '{
    "mappings": {
        "news": {
            "properties": {
                "source": { "type": "string", "index": "not_analyzed" },
                "headline": { "type": "object" },
                "timestamp": { "type": "date", "format": "date_hour_minute_second_millis" },
                "user": { "type": "string", "index": "not_analyzed" }
            }
        }
    }
}'

# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"news","_type":"news"}}
{"user": "John", "source": "CNN", "headline": "Great news", "timestamp": "2015-07-28T00:07:29.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "John", "source": "CNN", "headline": "More great news", "timestamp": "2015-07-28T00:08:23.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "John", "source": "ESPN", "headline": "Sports news", "timestamp": "2015-07-28T00:09:32.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "John", "source": "ESPN", "headline": "More sports news", "timestamp": "2015-07-28T00:10:35.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "Mary", "source": "Yahoo", "headline": "More news", "timestamp": "2015-07-28T00:11:54.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "Mary", "source": "Yahoo", "headline": "Crazy news", "timestamp": "2015-07-28T00:12:31.000"}
'

So how do I get the last CNN and last ESPN headlines from John for example?

I've been looking into the multi search API, but this would mean that I would need to know all the sources beforehand (in this case CNN and ESPN).

Gonçalo Cabrita
  • 353
  • 1
  • 5
  • 11

1 Answers1

9

First, please note that I had to change your mapping for the headline field to string, as in your sample documents headlines are strings and not objects.

So, a query like the following one would retrieve what you expect:

curl -XPOST "$ELASTICSEARCH_ENDPOINT/news/_search" -d '{
  "size": 0,
  "query": {
    "filtered": {
      "filter": {
        "term": {
          "user": "John"           <--- filter for user=John
        }
      }
    }
  },
  "aggs": {
    "sources": {
      "terms": {
        "field": "source"          <--- aggregate by source
      },
      "aggs": {
        "latest": {
          "top_hits": {
            "size": 1,             <--- only take the first...
            "_source": [           <--- only the date and headline
               "headline",
               "timestamp"
            ],
            "sort": {
              "timestamp": "desc"  <--- ...and only the latest hit
            }
          }
        }
      }
    }
  }
}'

That will yield something like this:

{
  ...
  "aggregations" : {
    "sources" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "CNN",
        "doc_count" : 2,
        "latest" : {
          "hits" : {
            "total" : 2,
            "max_score" : null,
            "hits" : [ {
              "_index" : "news",
              "_type" : "news",
              "_id" : "AU7Sh3VDGDddn2ZNuDVl",
              "_score" : null,
              "_source":{
                  "headline": "More great news", 
                  "timestamp": "2015-07-28T00:08:23.000"
              },
              "sort" : [ 1438042103000 ]
            } ]
          }
        }
      }, {
        "key" : "ESPN",
        "doc_count" : 2,
        "latest" : {
          "hits" : {
            "total" : 2,
            "max_score" : null,
            "hits" : [ {
              "_index" : "news",
              "_type" : "news",
              "_id" : "AU7Sh3VDGDddn2ZNuDVn",
              "_score" : null,
              "_source":{
                   "headline": "More sports news", 
                   "timestamp": "2015-07-28T00:10:35.000"
              },
              "sort" : [ 1438042235000 ]
            } ]
          }
        }
      } ]
    }
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360
  • 1
    Thx so much! That's exactly what I was looking for! I had no idea you could use aggregations to extract the actual _source, I though it was just for statistics. – Gonçalo Cabrita Jul 28 '15 at 17:12
  • @Val I have a problem similar to above, but I want one more level of aggregation on top of what is achieved here. Looking forward for your response. http://stackoverflow.com/questions/38195420/elasticsearch-aggregation-over-top-hits – SuperCoder Jul 12 '16 at 10:37
  • Would it be possible to aggregate by multiple fields in the same manner? – SDekov Dec 01 '16 at 14:22
  • @StoyanDekov "Aggregate by multiple fields" can mean different things so I suggest you create a new question referencing this one and clearly explaining your need, so you increase your chances of getting an answer. – Val Dec 01 '16 at 14:25
  • Thank you. [Elasticsearch get the latest document, grouped by multiple fields](http://stackoverflow.com/q/40913326/5020253) raised. – SDekov Dec 01 '16 at 14:37
  • i really liek how this query gets to 5 levels deep into {} – pudiva Jan 08 '17 at 09:57