4

I have a collection of documents which belongs to few authors:

[
  { id: 1, author_id: 'mark', content: [...] },
  { id: 2, author_id: 'pierre', content: [...] },
  { id: 3, author_id: 'pierre', content: [...] },
  { id: 4, author_id: 'mark', content: [...] },
  { id: 5, author_id: 'william', content: [...] },
  ...
]

I'd like to retrieve and paginate a distinct selection of best matching document based upon the author's id:

[
  { id: 1, author_id: 'mark', content: [...], _score: 100 },
  { id: 3, author_id: 'pierre', content: [...], _score: 90 },
  { id: 5, author_id: 'william', content: [...], _score: 80 },
  ...
]

Here's what I'm currently doing (pseudo-code):

unique_docs = res.results.to_a.uniq{ |doc| doc.author_id }

Problem is right on pagination: How to select 20 "distinct" documents?

Some people are pointing term facets, but I'm not actually doing a tag cloud:

Thanks,
Adit

Community
  • 1
  • 1
Adit Saxena
  • 1,617
  • 15
  • 25
  • What are the scores are in the results? – ramseykhalaf Jul 31 '13 at 04:56
  • Term facet does this very well. You should try it. – shyos Jul 31 '13 at 07:12
  • Hi @shyos if term facet are facets they tell me there are some unique documents, but not 1. how do they score among other documents - 2. I don't think its possible to paginate (eg. show 20 docs skipping first 300 distinct results) - 3. they don't allow highlighting and all other benefits – Adit Saxena Aug 01 '13 at 13:29

2 Answers2

4

As at present ElasticSearch does not provide a group_by equivalent, here's my attempt to do it manually.
While the ES community is working for a direct solution to this problem (probably a plugin) here's a basic attempt which works for my needs.

Assumptions.

  1. I'm looking for relevant content

  2. I've assumed that first 300 docs are relevant, so I consider restricting my research to this selection, regardless many or some of these are from the same few authors.

  3. for my needs I didn't "really" needed full pagination, it was enough a "show more" button updated through ajax.

Drawbacks

  1. results are not precise
    as we take 300 docs per time we don't know how many unique docs will come out (possibly could be 300 docs from the same author!). You should understand if it fits your average number of docs per author and probably consider a limit.

  2. you need to do 2 queries (waiting remote call cost):

    • first query asks for 300 relevant docs with just these fields: id & author_id
    • retrieve full docs of paginated ids in a second query

Here's some ruby pseudo-code: https://gist.github.com/saxxi/6495116

Adit Saxena
  • 1,617
  • 15
  • 25
0

Now the 'group_by' issue have been updated, you can use this feature from elastic 1.3.0 #6124.

If you search for following query,

{
    "aggs": {
        "user_count": {
            "terms": {
                "field": "author_id",
                "size": 0
            }
        }
    }
}

you will get result

{
  "took" : 123,
  "timed_out" : false,
  "_shards" : { ... },
  "hits" : { ... },
  "aggregations" : {
    "user_count" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "mark",
        "doc_count" : 87350
      }, {
        "key" : "pierre",
        "doc_count" : 41809
      }, {
        "key" : "william",
        "doc_count" : 24476
      } ]
    }
  }
}
Miae Kim
  • 1,713
  • 19
  • 21