0

I have documents with (id, value, modified_date). Need to get all the documents for ids which have a specific value as of the last modified_date.

My understanding is that I first need to find such ids and then put them inside a bigger query. To find such ids, looks like, I would use "top_hits" with some post-filtering of the results.

The goal is to do as much work as possible on the server side to speed things up. Would've been trivial in SQL, but with ElasticSearch I am at a loss. And then I would need to write this in python using elasticsearch_dsl. Can anyone help?

UPDATE: In case it's not clear, "all the documents for ids which have a specific value as of the last modified_date" means: 1. group by id, 2. in each group select the record with the largest modified_date, 3. keep only those records that have the specific value, 4. from those records keep only ids, 5. get all documents where ids are in the list coming from 4.

Specifically, 1 is an aggregation, 2 is another aggregation using "top_hits" and reverse sorting by date, 3 is an analog of SQL's HAVING clause - Bucket Selector Aggregation (?), 4 _source, 5 terms-lookup.

My biggest challenge so far has been figuring out that Bucket Selector Aggregation is what I need and putting things together.

  • Can't help with the python part, but I'm not sure I understand the exact query. You need the documents with modified_date:, but what is your intention with the ids? – MrSimple Jan 15 '18 at 09:16
  • @MrSimple Let me be more specific. Suppose I have (product_id, price, modified_date). Basically, these are historic prices for different products. What I want is to get all the historic prices for products whose current (=last) price is less than $50. As I understand, first I need to get all product_id's, whose current price is less than $50; and, second, get all the historic records for those product_id's. – Dmitry Frumkin Jan 15 '18 at 13:10
  • I know this is very little lead, but as far as I know Elasticsearch doesn't do join querys. See this: https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html – MrSimple Jan 15 '18 at 13:20
  • @MrSimple The term lookup query mentioned there could work. But I would be happy with two separate queries as well. That is getting ids first, and then running the second query with those ids copied into it. – Dmitry Frumkin Jan 15 '18 at 13:30

1 Answers1

0

This shows an example on how to get the latest elements in each group:

How to get latest values for each group with an Elasticsearch query?

This will return the average price bucketed in days intervals:

GET /logstash-*/_search?size=0
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1d",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      },
      "aggs": {
        "1": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

I wrote it so it matches all record, that obviously returns more data than you need. Depending on the amount of data it might be easier to finish the task on client side.

MrSimple
  • 599
  • 4
  • 14
  • Yes, I've seen that reference already, but was confused how to proceed after having "top_hits". The rest of what you wrote - I don't know how it may be relevant to my question. – Dmitry Frumkin Jan 15 '18 at 14:05
  • You can extract the product_ids from the top hits. The second part is in fact irrelevant, I forgot that you need the average prices in a time series. I gave you the average price for products. Let me fix it. – MrSimple Jan 15 '18 at 14:15
  • I don't understand what you are answering. I've updated my question with how I see this should be handled. – Dmitry Frumkin Jan 15 '18 at 14:37
  • Sry, I have no SQL experience. In my suggestion the linked part is 1) and 2), you would do manually 3) and 4) on client side. And if you need all the documents without sorting them into a date histogram, then you don't need my second part, you send a query with those ids that you select manually on client side. – MrSimple Jan 15 '18 at 14:47
  • I've been doing everything on the client side in memory (pandas), but have been looking for ways to speed things up because reading from ElasticSearch can take more time than a user is willing to wait. – Dmitry Frumkin Jan 15 '18 at 15:12