0

I need to calculate the weighted average value using the elastic search, I can't change the structure of the documents. If we assume that there are 2 indexed documents.
The first document

const doc1 = {
  "id": "1",
  "userId: "2",
  "scores" : [
    {
      "name": "score1",
      "value": 93.0
    },
    {
      "name": "score2",
      "value": 90.0
    },
    {
      "name": "score3",
      "value": 76.0
    }
  ],
  "metadata": {
    "weight": 130
  }
}

Second document

const doc2 = {
  "id": "2",
  "userId: "2",
  "scores" : [
    {
      "name": "score1",
      "value": 80.0
    },
    {
      "name": "score2",
      "value": 70.0
    },
    {
      "name": "score3",
      "value": 88.0
    }
  ],
  "metadata": {
    "weight": 50
  }
}

Calculations should be done by the following formula:

score1Avg = (doc1.scores['score1'].value * doc1.metadata.weight +
       doc2.scores['score1'].value * doc2.metadata.weight)/(doc1.weight+doc2.weight)

score2Avg = (doc1.scores['score2'].value * doc1.metadata.weight +
       doc2.scores['score2'].value * doc2.metadata.weight)/(doc1.weight+doc2.weight)

score3Avg = (doc1.scores['score3'].value * doc1.metadata.weight +
       doc2.scores['score3'].value * doc2.metadata.weight)/(doc1.weight+doc2.weight)

I tried something with nested type for mapping scores, but I can't access the parent document field metadata.weight. How this should be approached, should I use nested type mapping or this can be done in some other way without that?

Edit: I ended up storing scores element as separated documents. Instead of doc1, now I have the following documents.

{
  "id": "1",
  "userId: "2",
  "score": {
      "name": "score1",
      "value": 93.0
    },
  "metadata": {
    "weight": 130
  }
}
{
  "id": "1",
  "userId: "2",
  "score": {
      "name": "score2",
      "value": 90.0
    },
  "metadata": {
    "weight": 130
  }
}
{
  "id": "1",
  "userId: "2",
  "score": {
      "name": "score3",
      "value": 76.0
    },
  "metadata": {
    "weight": 130
  }
}

And the query is:

GET /scores/_search
{
  "size": 0,
   "aggs": {
        "group_by_score_and_user": {
              "composite": {
                "sources": [
                  {
                    "scoreName": {
                      "terms": {
                        "field": "score.name.keyword"
                      }
                    }
                  },{
                    "userId": {
                      "terms": {
                        "field": "userId.keyword"
                      }
                    }
                  }
                ]
              },
              "aggs": {
                "avg": {
                  "weighted_avg": {
                    "value":{ "field": "score.value" },
                    "weight":{ "field": "metadata.weight" }
                  }
                }
              }
            }
  }
}

Btw, the query with the script approach against 5k documents takes 120 ms on average compared to this which takes about 35-40 ms over 100k documents.

1 Answers1

0

Edited to fit the requirement in the comment, like I said before this is not an optimal solution at all, the usage of scripts + params._source + my subpar java will cause this to be very slow or unusable with a lot of docs.

Still I learned a lot

Mapping:

{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "userId": {
        "type": "keyword"
      },
      "scores": {
        "properties": {
          "name": {
            "type": "keyword"
          },
          "value": {
            "type": "float"
          }
        }
      },
      "metadata": {
        "properties": {
          "weight": {
            "type": "float"
          }
        }
      }
    }
  }
}

Docs:

POST ron_test/_doc/1
{
  "id": "1",
  "userId": "2",
  "scores" : [
    {
      "name": "score1",
      "value": 93.0
    },
    {
      "name": "score2",
      "value": 90.0
    },
    {
      "name": "score3",
      "value": 76.0
    }
  ],
  "metadata": {
    "weight": 130
  }
}

POST ron_test/_doc/2
{
  "id": "2",
  "userId": "2",
  "scores" : [
    {
      "name": "score1",
      "value": 80.0
    },
    {
      "name": "score2",
      "value": 70.0
    },
    {
      "name": "score3",
      "value": 88.0
    }
  ],
  "metadata": {
    "weight": 50
  }
}

POST ron_test/_doc/3
{
  "id": "2",
  "userId": "2",
  "scores" : [
    {
      "name": "score1",
      "value": 80.0
    },
    {
      "name": "score2",
      "value": 70.0
    },
    {
      "name": "score9",
      "value": 88.0
    }
  ],
  "metadata": {
    "weight": 12
  }
}

POST ron_test/_doc/4
{
  "id": "2",
  "userId": "2",
  "scores" : [
    {
      "name": "score9",
      "value": 50.0
    }
  ],
  "metadata": {
    "weight": 17
  }
}

Query

GET ron_test/_search
{
  "size": 0,
  "aggs": {
    "weigthed_avg": {
      "scripted_metric": {
        "init_script": """
        state.name_to_sum = new HashMap();
        state.name_to_weight = new HashMap();
        """,
        "map_script": """
        for (score in params._source['scores']){
          def name = score['name'];
          def value = score['value'];
          def weight = doc['metadata.weight'].value;
          
          if (state.name_to_sum.containsKey(name)){
            state.name_to_sum[name] += value * weight;
          }
          else {
            state.name_to_sum[name] = value * weight;
          }
          
          if (state.name_to_weight.containsKey(name)){
            state.name_to_weight[name] += weight;
          }
          else {
            state.name_to_weight[name] = weight;
          }
          
        }
        """,
        "combine_script": "return [state.name_to_sum, state.name_to_weight]",
        "reduce_script": """
        def total_score_per_name = new HashMap();
        def total_weigth_per_name = new HashMap();
        
        for (state in states){
          total_score_per_name = Stream.concat(total_score_per_name.entrySet().stream(), state[0].entrySet().stream())
             .collect(Collectors.groupingBy(Map.Entry::getKey,
             Collectors.summingDouble(Map.Entry::getValue)));
             
          total_weigth_per_name = Stream.concat(total_weigth_per_name.entrySet().stream(), state[1].entrySet().stream())
             .collect(Collectors.groupingBy(Map.Entry::getKey,
             Collectors.summingDouble(Map.Entry::getValue)));
        }
        
        def results = new HashMap();
        total_score_per_name.forEach((name, score) -> results[name] = score / total_weigth_per_name[name]);
        return results;
        """
      }
    }
  }
}

Results

{
  "took" : 258,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "weigthed_avg" : {
      "value" : {
        "score9" : 65.72413793103448,
        "score2" : 83.54166666666667,
        "score3" : 79.33333333333333,
        "score1" : 88.80208333333333
      }
    }
  }
}

More info on scripted metrics https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-scripted-metric-aggregation.html

Btw, the way I would choose to simplify this is to insert the metadata.weight value inside every nested score

Ron Serruya
  • 3,988
  • 1
  • 16
  • 26
  • I'm afraid that this is not a satisfied solution, score1, 2 and 3 are just examples, there can be an arbitrary number of scores all with different names but the same overall documents, so it would be great if there would be some way to group into buckets by `score.name` value and then do calculations, is there some way to provide weight to nested documents? I successfully group them but can't do calculations because I’m missing the reference to the `metadata.weight`. Thanks. – Nemanja Stankovic Sep 30 '21 at 17:36
  • @NemanjaStankovic Edited my answer – Ron Serruya Sep 30 '21 at 19:39
  • Exactly my thoughts the last one you said, to put a weight inside of every score. As there will not be more than 10 scores. But it will be probably around 48 million documents created each month. So the script approach probably would fail soon or later. Thank you for your time and effort. – Nemanja Stankovic Oct 04 '21 at 09:06