2

I need a way to create a transform that will aggregate the average of a field but without the outliers (let's say only the values that falls between 10%-90% percentiles). for example if I have the following documents:

[
{someField:1},
{someField:2},
{someField:3},
{someField:4},
{someField:5},
{someField:6},
{someField:7},
{someField:8},
{someField:9},
{someField:10}
]

It would calculate the average of 2-9

Edited: renamed "value" to "someField"

Ofir
  • 517
  • 4
  • 16
  • 1
    I presume the field name `value` is just a placeholder but although it's not a reserved ES keyword, I'd refrain from using it because it's confusing, esp. when accessing field values in scripts -- as seen in my answer below. – Joe - GMapsBook.com Dec 31 '20 at 12:53
  • good point, edited! – Ofir Dec 31 '20 at 16:21

2 Answers2

0

You can do this in one go with a scripted_metric aggregation but you'd have to write the percentiles function and then the avg function -- I wrote one here. But the script isn't going to be performant so I don't think it'd be worth the effort…

I'd instead suggest to first retrieve the percentile bounds:

POST myindex/_search
{
  "size": 0,
  "aggs": {
    "boundaries": {
      "percentiles": {
        "field": "value",
        "percents": [
          10,
          90
        ]
      }
    }
  }
}

yielding [1.5, 9.5] and then plug these numbers into a weighted average aggregation:

POST myindex/_search
{
  "size": 0,
  "aggs": {
    "avg_without_outliers": {
      "weighted_avg": {
        "value": {
          "field": "value"
        },
        "weight": {
          "script": {
            "source": "def v = doc.value.value; return v <= params.min || v >= params.max ? 0 : 1",
            "params": {
              "min": 1.5,
              "max": 9.5
            }
          }
        }
      }
    }
  }
}

The weight is either 0 or 1, depending on whether the particular doc that's being traversed is an outlier or not.

Paulo
  • 8,690
  • 5
  • 20
  • 34
Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
  • 10x Joe. So I would have to use 2 requests. is there any way to create a transform that will do that? Also what is the purpose of the weighted avg? couldn't I just filter all the values between 1.5 and 9.5 with a range query and use regular avg aggregation? seems to be faster – Ofir Dec 31 '20 at 16:47
  • Good point -- you surely could use a range query. Again, I don't think there's a transform that doesn't use a scripted metric agg. – Joe - GMapsBook.com Jan 01 '21 at 12:50
0

Simple Implementation

  1. You query to find the distribution and find upper limit and lower of outlier.
  2. You run you query with range query with upper limit and lower remove all documents outlier.

Docs : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html

Query 1

GET you-index-documents/_search
{
  "size": 0,
  "aggs": {
    "load_time_outlier": {
      "percentiles": {
        "field": "your-number-feild",
        "percents": [
          5,
          95
        ]
      }
    }
  }
}

Query 2

GET your-index-documents/_search
{
  "query": {
    "bool": {
      "must": [
        //All your other query
        {
          "range": {
            "your-number-feild": {
              "gte": "41",
              "lte": "39478"
            }
          }
        }
      ]
    }
  }
}