I'm trying to partition an aggregate similar to the example in the ElasticSearch documentation, but am not getting the example to work.
The index is populated with event-types:
public class Event
{
public int EventId { get; set; }
public string SegmentId { get; set; }
public DateTime Timestamp { get; set; }
}
The EventId
is unique, and each event belongs to a specific SegmentId
. Each SegmentId can be associated with zero to many events.
The question is:
How do I get the latest EventId
for each SegmentId
?
I expect the number of unique segments to be in the range of 10 millions, and the number of unique events one or two magnitudes greater. That's why I don't think using top_hits
by itself is appropriate, as suggested here. Hence, partitioning.
Example:
I have set up a demo-index populated with 1313 documents (unique EventId
), belonging to 101 distinct SegmentId
(i.e. 13 events per segment). I would expect the query below to work, but the exact same results are returned regardless of which partition
number I specify.
POST /demo/_search
{
"size": 0,
"aggs": {
"segments": {
"terms": {
"field": "segmentId",
"size": 15, <-- I want 15 segments from each query
"include": {
"partition": 0, <-- Trying to retrieve the first partition
"num_partitions": 7 <-- Expecting 7 partitions (7*15 > 101 segments)
}
},
"aggs": {
"latest": {
"top_hits": {
"size": 1,
"_source": [
"timestamp",
"eventId",
"segmentId"
],
"sort": {
"timestamp": "desc"
}
}
}
}
}
}
}
If I remove the include
and set size
to a value greater than 101, I get the latest event for every segment. However, I doubt that is a good approach with a million buckets...