select distinct from elasticsearch

Question

I have a collection of documents which belongs to few authors:

[
  { id: 1, author_id: 'mark', content: [...] },
  { id: 2, author_id: 'pierre', content: [...] },
  { id: 3, author_id: 'pierre', content: [...] },
  { id: 4, author_id: 'mark', content: [...] },
  { id: 5, author_id: 'william', content: [...] },
  ...
]

I'd like to retrieve and paginate a distinct selection of best matching document based upon the author's id:

[
  { id: 1, author_id: 'mark', content: [...], _score: 100 },
  { id: 3, author_id: 'pierre', content: [...], _score: 90 },
  { id: 5, author_id: 'william', content: [...], _score: 80 },
  ...
]

Here's what I'm currently doing (pseudo-code):

unique_docs = res.results.to_a.uniq{ |doc| doc.author_id }

Problem is right on pagination: How to select 20 "distinct" documents?

Some people are pointing term facets, but I'm not actually doing a tag cloud:

Thanks,
Adit

Hi @shyos if term facet are facets they tell me there are some unique documents, but not 1. how do they score among other documents - 2. I don't think its possible to paginate (eg. show 20 docs skipping first 300 distinct results) - 3. they don't allow highlighting and all other benefits — Adit Saxena, Aug 01 '13 at 13:29

score 4 · Accepted Answer · answered Sep 11 '13 at 13:58

As at present ElasticSearch does not provide a group_by equivalent, here's my attempt to do it manually.
While the ES community is working for a direct solution to this problem (probably a plugin) here's a basic attempt which works for my needs.

Assumptions.

I'm looking for relevant content
I've assumed that first 300 docs are relevant, so I consider restricting my research to this selection, regardless many or some of these are from the same few authors.
for my needs I didn't "really" needed full pagination, it was enough a "show more" button updated through ajax.

Drawbacks

results are not precise
as we take 300 docs per time we don't know how many unique docs will come out (possibly could be 300 docs from the same author!). You should understand if it fits your average number of docs per author and probably consider a limit.
you need to do 2 queries (waiting remote call cost):
- first query asks for 300 relevant docs with just these fields: id & author_id
- retrieve full docs of paginated ids in a second query

Here's some ruby pseudo-code: https://gist.github.com/saxxi/6495116

score 0 · Answer 2 · answered Nov 13 '15 at 21:54

Now the 'group_by' issue have been updated, you can use this feature from elastic 1.3.0 #6124.

If you search for following query,

{
    "aggs": {
        "user_count": {
            "terms": {
                "field": "author_id",
                "size": 0
            }
        }
    }
}

you will get result

{
  "took" : 123,
  "timed_out" : false,
  "_shards" : { ... },
  "hits" : { ... },
  "aggregations" : {
    "user_count" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "mark",
        "doc_count" : 87350
      }, {
        "key" : "pierre",
        "doc_count" : 41809
      }, {
        "key" : "william",
        "doc_count" : 24476
      } ]
    }
  }
}

select distinct from elasticsearch

2 Answers2