9

As far as I'm aware, there isn't a way to do something like the following in Elasticsearch:

SELECT * FROM myindex
GROUP BY agg_field1, agg_field2, agg_field3 // aggregation
ORDER BY order_field1, order_field2, order_field3 // sort
LIMIT 1000, 5000 // paginate -- get page 6 of size 1000 records

Here are some related documents regarding this:

Is there a way to do the above in Elasticsearch? The one limitation we have is we will never have more than 10M records, so we (hopefully) shouldn't run into memory errors. My thinking was to do it as follows:

  • Do an aggregation query
  • Get the number of results from it
  • Split it into N segments based on the results and page size we want
  • Rerun the query with the above segments

What would be the best way to accomplish this? In your answer/suggestion, could you please post some sample code relating to how the above SQL query could be done in ES?


As an update to this question, here is a public index to test with:

# 5.6
e=Elasticsearch('https://search-testinges-fekocjpedql2f3rneuagyukvy4.us-west-1.es.amazonaws.com')
e.search('testindex')

# 6.4 (same data as above)
e = Elasticsearch('https://search-testinges6-fycj5kjd7l5uyo6npycuashch4.us-west-1.es.amazonaws.com')
e.search('testindex6')

It has 10,000 records. Feel free to test with it:

enter image description here

The query that I'm looking to do is as follows (in sql):

SELECT * FROM testindex
GROUP BY store_url, status, title
ORDER BY title ASC, status DESC
LIMIT 100 OFFSET 6000

In other words, I'm looking to sort an aggregated result (with multiple aggregations) and get an offset.

  • As of ES 6.1 there's a new aggregation called [composite](https://www.elastic.co/guide/en/elasticsearch/reference/6.6/search-aggregations-bucket-composite-aggregation.html) that might help you achieve this, but since you're running 5.6.8, it's not there, unfortunately. Moreover, grouping by title won't be possible as `title` is of type `text` and not `keyword`. – Val Feb 21 '19 at 04:56
  • Moreover, I don't see how it makes sense to sort by other fields than the ones you group by. I mean, just in the `store_url` group you're going to have several different `retailer_id` values and the one showing up might not necessarily represent all same stores. Or am I missing something obvious here? – Val Feb 21 '19 at 05:15
  • @Val -- thanks for the feedback, I've updated the index so that's a keyword type. I'll change the sort so it makes more sense. –  Feb 21 '19 at 05:27
  • @Val I've also created a new index with the same data that's ES 6.4 if you want to try it out (see updated question). –  Feb 21 '19 at 05:40
  • Thanks, but there are no docs in testindex6. I've reindexed testindex into testindex6 – Val Feb 21 '19 at 05:52

2 Answers2

4

The composite aggregation might help here as it allows you to group by multiple fields and then paginate over the results. The only thing that it doesn't let you do is to jump at a given offset, but you can do that by iterating from your client code if at all necessary.

So here is a sample query to do that:

POST testindex6/_search
{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ]
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

In the response you'll see and after_key structure:

  "after_key": {
    "store": "http://google.com1087",
    "status": "OK1087",
    "title": "Titanic1087"
  },

It's some kind of cursor that you need to use in your subsequent queries, like this:

{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ],
        "after": {
          "store": "http://google.com1087",
          "status": "OK1087",
          "title": "Titanic1087"
        }
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

And it will give you the next 100 buckets. Hopefully this helps.

UPDATE:

If you want to know how many buckets in total there is going to be, the composite aggregation won't give you that number. However, since the composite aggregation is nothing else than a cartesian product of all the fields in its sources, you can get a good approximation of that total number by also returning the ]cardinality](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html) of each field used in the composite aggregation and multiplying them together.

  "aggs": {
    "my_buckets": {
      "composite": {
        ...
      }
    },
    "store_cardinality": {
      "cardinality": {
        "field": "store_url"
      }
    },
    "status_cardinality": {
      "cardinality": {
        "field": "status"
      }
    },
    "title_cardinality": {
      "cardinality": {
        "field": "title"
      }
    }
  }

We can then get the total number of buckets by multiplying the figure we get in store_cardinality, status_cardinality and title_cardinality together, or at least a good approximation thereof (it won't work well on high-cardinality fields, but pretty well on low-cardinality ones).

Val
  • 207,596
  • 13
  • 358
  • 360
  • Is it possible to get total result count when above composite-aggregate query with multiple field is used for pagination? – Bhavesh Mar 04 '19 at 13:32
  • What is the performance impact for high cardinality fields? Does Elastic repeat the operation under the hood even if we use the `after` token? Thanks so much! – animageofmine Jul 22 '20 at 06:38
  • the question included the sorting part, and the answer doesn't include a solution for it, anybody got the sorting working? – TValerii Feb 09 '21 at 10:22
3

Field collapsing is the answer.

Field collapsing feature is used when we want to group the hits on a specific field (as in group by agg_field).

Before Elastic 6, the way to group the fields is to use aggregation. This approach was lacking an ability to do efficient paging.

But now, with the field collapse provided out of the box by elastic, it is pretty easy.

Below is a sample query with field collapse taken from above link.

GET /twitter/_search
{
  "query": {
      "match": {
          "message": "elasticsearch"
      }
  },
  "collapse" : {
      "field" : "user", 
      "inner_hits": {
          "name": "last_tweets", 
          "size": 5, 
          "sort": [{ "date": "asc" }] 
      },
      "max_concurrent_group_searches": 4 
  },
  "sort": ["likes"]

}

Shailesh Pratapwar
  • 4,054
  • 3
  • 35
  • 46
  • what if you want to group by multiple fields? e.g. GROUP BY field1, field2, field3 ... so on. ES supports deep nested aggregation but field collapsing doesn't seem to be supporting it. – Bhavesh Feb 19 '19 at 10:55
  • Yes. Field collapsing will not cater to that. But it's better for what it is. – Shailesh Pratapwar Feb 19 '19 at 13:20
  • @ShaileshPratapwar could you please see the updated question? This has to do with sorting while having an aggregation across multiple fields. –  Feb 20 '19 at 06:12
  • @ShaileshPratapwar I've added a test index and the query I'm looking to use. I don't think you answer will solve it but please take a look! –  Feb 21 '19 at 01:11
  • it helped me a lot. Thanks. For a single field group by pagination for elastic search. – Muthukrishna C Apr 26 '20 at 09:47
  • Be aware of pagination limitation on ES side ([max 10,000 hits](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/paginate-search-results.html)). If you need to page through more than 10,000 hits, you should use the `search_after` parameter instead. Unfortunately, `collapse` cannot be used in conjunction with `search_after` ([link](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/collapse-search-results.html)). – lu_ko Feb 24 '21 at 16:17