2

I'm working on a project that uses Elasticsearch to store data and show some complex statistics.

I have an index in that looks like this:

Reservation {
  id: number
  check_in: Date
  check_out: Date
  created_at: Date
  // other fields...
}

I need to calculate the average days' difference between check_in and created_at of my Reservations in a specific date range and show the result as a number.

I tried this query:

{
  "script_fields": {
    "avgDates": {
      "script": {
        "lang": "expression",
        "source": "doc['created_at'].value - doc['check_in'].value"
      }
    }
  },
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "created_at": {
              "gte": "{{lastMountTimestamp}}",
              "lte": "{{currentTimestamp}}"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "avgBetweenDates": {
      "avg": {
        "field": "avgDates"
      }
    }
  }
}

Dates fields are saved in ISO 8601 form (eg: 2020-03-11T14:25:15+00:00), I don't know if this could produce issues.

It catches some hits, So, the query works for sure! but, it always returns null as the value of the avgBetweenDates aggregation.

I need a result like this:

"aggregations": {
    "avgBetweenDates": {
        "value": 3.14159 // Π is just an example!
    }
}

Any ideas will help!

Thank you.

Malek Boubakri
  • 820
  • 2
  • 17
  • 36

2 Answers2

3

Scripted Fields are not stored fields in ES. You can only perform aggregation on the stored fields as scripted fields are created on the fly.

You can simply move the script logic in the Average Aggregation as shown below. Note that for the sake of understanding, I've created sample mapping, documents, query and its response.

Mapping:

PUT my_date_index
{
  "mappings": {
    "properties": {
      "check_in":{
        "type":"date",
        "format": "date_time"
      },
      "check_out":{
        "type": "date",
        "format": "date_time"
      },
      "created_at":{
        "type": "date",
        "format": "date_time"
      }
    }
  }
}

Sample Documents:

POST my_date_index/_doc/1
{
  "check_in": "2019-01-15T00:00:00.000Z",
  "check_out": "2019-01-20T00:00:00.000Z",
  "created_at": "2019-01-17T00:00:00.000Z"
}

POST my_date_index/_doc/2
{
  "check_in": "2019-01-15T00:00:00.000Z",
  "check_out": "2019-01-22T00:00:00.000Z",
  "created_at": "2019-01-20T00:00:00.000Z"
}

Aggregation Query:

POST my_date_index/_search
{
  "size": 0,
  "aggs": {
    "my_dates_diff": {
      "avg": {
        "script": """
          ZonedDateTime d1 = doc['created_at'].value;
          ZonedDateTime d2 = doc['check_in'].value;
          long differenceInMillis = ChronoUnit.MILLIS.between(d1, d2);
          return Math.abs(differenceInMillis/86400000);
        """
      }
    }
  }
}

Notice, that you wanted difference in number of days. The above logic does that.

Response:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_dates_diff" : {
      "value" : 3.5              <---- Average in Number of Days
    }
  }
}

Hope this helps!

Kamal Kunjapur
  • 8,547
  • 2
  • 22
  • 32
  • This solution works for me! Thank you Kamal. But, Am I forced to remap my index just to add the `"format": "date_time"`. If yes, is there is a way to do so without deleting my old one. – Malek Boubakri Mar 12 '20 at 00:19
  • 1
    @MalekBoubakri You don't have to. Could you run the `agg` on the existing index you have and let me know if that works for you. I think it should work but if it doesn't please let me know. – Kamal Kunjapur Mar 12 '20 at 05:57
  • 1
    @Kamal I think it works without it! In fact, I think I shouldn't use it on all dates in my case! because it looks like the format `date_time` === `strict_date_time` and it force me to pass time value all the time, maybe `strict_date_optional_time` is better in some cases. Now, I am reading the [docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html#built-in-date-formats) for further info. thank you. – Malek Boubakri Mar 12 '20 at 08:23
2

Scripted fields created within the _search context can only be consumed within that scope. They're not visible within the aggregations! This means you'll have to go with either

  • moving your script to the aggs section and doing the avg there
  • a scripted metric aggregation (quite slow and difficult to get right)
  • or creating a dateDifference field at index time (preferably an int -- a difference of the timestamps) which will enable you to perform powerful numeric aggs like extended stats which provide a statistically useful output like:
{
    ...

    "aggregations": {
        "grades_stats": {
           "count": 2,
           "min": 50.0,
           "max": 100.0,
           "avg": 75.0,
           "sum": 150.0,
           "sum_of_squares": 12500.0,
           "variance": 625.0,
           "std_deviation": 25.0,
           "std_deviation_bounds": {
            "upper": 125.0,
            "lower": 25.0
           }
        }
    }
}

and are always faster than computing the timestamp differences with a script.

Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
  • Thank you jzzfs. I taked this **first option** , I moved my script to the `aggs` section as shown in your link and demonstrated in the accepted answer. Still believe that the **third option** is always the most performant and better choice but in my case, I need to query it. – Malek Boubakri Mar 12 '20 at 00:25
  • 1
    You can query it! It's a regular new field. You can query _and_ aggregate on it. – Joe - GMapsBook.com Mar 12 '20 at 08:38