0

Two indexes are created with the dates.

First index mapping:

PUT /index_one
{
    "mappings": {
        "properties": {
            "date_start": {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
            }
        }
    }
}

Second index mapping:

PUT /index_two
{
    "mappings": {
        "properties": {
            "date_end": {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
            }
        }
    }
}

Need to find a date in a certain range and perform aggregation average of the dates difference.

Tried to make a request like this:

GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
    "aggs": {
        "filtered_dates": {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "exists": {
                                "field": "date_start"
                            }
                        },
                        {
                            "exists": {
                                "field": "date_end"
                            }
                        }
                    ]
                }
            },
            "aggs": {
                "avg_date": {
                    "avg": {
                        "script": {
                            "lang": "painless",
                            "source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
                        }
                    }
                }
            }
        }
    }
}

I get the following response to the request:

{
    "hits": {
        "total": {
            "value": 16508
        },
        "hits": [
            {
                "_index": "index_one",
                "_type": "_doc",
                "_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
                "_score": 1.0,
                "_source": {
                    "date_begin": "2021-02-26 07:26:29.732+0300"
                }
            }
        ]
    },
    "aggregations": {
        "filtered_dates": {
            "meta": {},
            "doc_count": 0,
            "avg_date": {
                "value": null
            }
        }
    }
}

Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?

fedor-sg
  • 219
  • 3
  • 12

1 Answers1

1

If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.


Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.

One option would be to use their ids:

POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }

POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }

POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }

POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }

After that, it's possible to:

  1. Target multiple indices — as you already do.
  2. Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
  3. Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
  4. Use a bucket_script aggregation to calculate their difference (in milliseconds).
  5. Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.

In concrete terms:

GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
  "aggs": {
    "grouped_by_id": {
      "terms": {
        "field": "id",
        "min_doc_count": 2,
        "size": 10
      },
      "aggs": {
        "min_date": {
          "min": {
            "field": "date_start"
          }
        },
        "max_date": {
          "max": {
            "field": "date_end"
          }
        },
        "diff": {
          "bucket_script": {
            "buckets_path": {
              "min": "min_date",
              "max": "max_date"
            },
            "script": "params.max - params.min"
          }
        }
      }
    },
    "avg_duration_across_the_board": {
      "avg_bucket": {
        "buckets_path": "grouped_by_id>diff",
        "gap_policy": "skip"
      }
    }
  }
}

If everything goes right, you'll end up with:

...
"aggregations" : {
  "grouped_by_id" : {
    ...
  },
  "avg_duration_across_the_board" : {
    "value" : 1.70208E10        <-- 17,020,800,000 milliseconds ~ 4,728 hrs
  }
}

⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.


Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples

Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
  • Thanks for the detailed answer. This ID appears in two indexes, but has a different name. Is it possible in this case to group documents? – fedor-sg Mar 24 '21 at 11:27
  • Yes — use whatever Id field you got and let me know how it went. – Joe - GMapsBook.com Mar 24 '21 at 11:49
  • Clarification - the two indexes have different field names. For instance, "uuid", "guid" – fedor-sg Mar 24 '21 at 11:53
  • Oh I see — terms agg supports a script too so you can target either of those. – Joe - GMapsBook.com Mar 24 '21 at 12:27
  • In my example, it does not work, because the field is not integer, but text. An error occurs: "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [gid] in order to load field data by uninverting the inverted index. Note that this can use significant memory." – fedor-sg Mar 29 '21 at 11:06
  • Add a keyword multifield like here -- https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#fielddata-mapping-param – Joe - GMapsBook.com Mar 29 '21 at 11:25
  • If I understand correctly, then the index should be re-created. Is it possible to do something if there are already documents in the index? – fedor-sg Mar 29 '21 at 11:49
  • Yes. I think the following should work: https://stackoverflow.com/questions/66817243/convert-timestamps-to-datetime-for-use-in-elasticsearch-aggregations/66818877#66818877 You can adapt it to your text field. – Joe - GMapsBook.com Mar 29 '21 at 11:58
  • I was able to update the fields in the index. But the average turned out to be very large - 6357086.4. The average should be ~10,000 milliseconds. – fedor-sg Mar 29 '21 at 13:26
  • Can you test it with a very small subset of documents so that the avg can be manually compared? – Joe - GMapsBook.com Mar 29 '21 at 13:28