0

I am new to Elastic Search and was exploring aggregation query. The documents I have are in the format -

{"name":"A",
     "class":"10th",
     "subjects":{
         "S1":92,
         "S2":92,
         "S3":92,
     }
}

We have about 40k such documents in our ES with the Subjects varying from student to student. The query to the system can be to aggregate all subject-wise scores for a given class. We tried to create a bucket aggregation query as explained in this guide here, however, this generates a single bucket per document and in our understanding requires an explicit mention of every subject.

We want to system to generate subject wise aggregate for the data by executing a single aggregation query, the problem I face is that in our data the subjects could vary from student to student and we don't have a global list of subject keys.

We wrote the following script but this only works if we know all possible subjects.

GET student_data_v1_1/_search

{ "query" :
    {"match" : 
         { "class" : "' + query + '" }}, 
         "aggs" : { "my_buckets" : { "terms" : 
         { "field" : "subjects", "size":10000 },
         "aggregations": {"the_avg": 
                      {"avg": { "field": "subjects.value" }}} }},
          "size" : 0 }'

but this query only works for the document structure, but does not work multiple subjects are defined where we may not know the key-pair -

{"name":"A",
     "class":"10th",
     "subjects":{
         "value":93
     }
}

An alternate form the document is present is that the subject is a list of dictionaries -

    {"name":"A",
     "class":"10th",
     "subjects":[
         {"S1":92},
         {"S2":92},
         {"S3":92},
     ]
}

Having an aggregation query to solve either of the 2 document formats would be helpful.
======EDITS======

After updating the document to hold weights for each subject -

        {
class": "10th",
   "subject": [
 {
   "name": "s1",
   "marks": 90,
   "weight":30
 },
 {
   "name": "s2",
  "marks": 80,
   "weight":70
 }
   ]}

I have updated the query to be -

{
  "query": {
    "match": {
      "class": "10th"
}
  },
  "aggs": {
    "subjects": {
      "nested": {
        "path": "scores"
      },    
      "aggs": {
        "subjects": {
          "terms": {
            "field": "subject.name"
      },
      "aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } }
      }
    }
  }
}
      },
      "size": 0
    }

but it throws the error-

{u'error': {u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'root_cause': [{u'col': 312,
u'line': 1,
u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]',
u'type': u'unknown_named_object_exception'}],
u'type': u'unknown_named_object_exception'},
u'status': 400}
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
User54211
  • 121
  • 2
  • 11
  • Can you share the mapping of index in both the above cases. You are trying to apply aggregation on `subject.value` whereas `value` property doesn't exist under `subject` – Nishant Nov 26 '18 at 16:46
  • @NishantSaini I have updated the question with the sample document with "value" property. It only works if we know all the possible key pairs or/and select only 1 key/document. – User54211 Nov 26 '18 at 18:24

1 Answers1

1

To achieve the required result I would suggest you to keep your index mapping as follows:

{
  "properties": {
    "class": {
      "type": "keyword"
    },
    "subject": {
      "type": "nested",
      "properties": {
        "marks": {
          "type": "integer"
        },
        "name": {
          "type": "keyword"
        }
      }
    }
  }
}

In the mapping above I have created subject as nested type with two properties, name to hold subject name and marks to hold marks in the subject.

Sample doc:

{
  "class": "10th",
  "subject": [
    {
      "name": "s1",
      "marks": 90
    },
    {
      "name": "s2",
      "marks": 80
    }
  ]
}

Now you can use nested aggregation and multilevel aggregation (i.e. aggregation inside aggregation). I used nested aggregation with terms aggregation for subject.name to get bucket containing all the available subjects. Then to get avg for each subject we add a child aggregation of avg to the subjects aggregation as below:

{
  "query": {
    "match": {
      "class": "10th"
    }
  },
  "aggs": {
    "subjects": {
      "nested": {
        "path": "subject"
      },
      "aggs": {
        "subjects": {
          "terms": {
            "field": "subject.name"
          },
          "aggs": {
            "avg_score": {
              "avg": {
                "field": "subject.marks"
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

NOTE: I have added "size" : 0 so that elastic doesn't return matching docs in the result. To include or exclude it depends totally on your use case.

Sample result:

{
  "took": 25,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0,
    "hits": [

    ]
  },
  "aggregations": {
    "subjects": {
      "doc_count": 6,
      "subjects": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "s1",
            "doc_count": 3,
            "avg_score": {
              "value": 80
            }
          },
          {
            "key": "s2",
            "doc_count": 2,
            "avg_score": {
              "value": 75
            }
          },
          {
            "key": "s3",
            "doc_count": 1,
            "avg_score": {
              "value": 80
            }
          }
        ]
      }
    }
  }
}

As you can see the result contains buckets with key as subject name and avg_score.value as the avg of marks.

UPDATE to include weighted_avg:

{
  "query": {
    "match": {
      "class": "10th"
    }
  },
  "aggs": {
    "subjects": {
      "nested": {
        "path": "subject"
      },
      "aggs": {
        "subjects": {
          "terms": {
            "field": "subject.name"
          },
          "aggs": {
            "avg_score": {
              "avg": {
                "field": "subject.marks"
              }
            },
            "weighted_grade": {
              "weighted_avg": {
                "value": {
                  "field": "subject.marks"
                },
                "weight": {
                  "field": "subject.weight"
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}
Nishant
  • 7,504
  • 1
  • 21
  • 34
  • thanks @nishantsaini this worked exactly as needed. I have added a "weight" key while ingesting the documents - { "class": "10th", "subject": [ { "name": "s1", "marks": 90, "weight":30 }, { "name": "s2", "marks": 80, "weight":70 } ] } but on updating the query to weighted average the score is throwing an error. I have updated the nested "aggs" to - "aggs" : { "weighted_grade": { "weighted_avg": { "value": { "field": "subjects.score" }, "weight": { "field": "subjects.weight" } } } } – User54211 Nov 27 '18 at 06:34
  • I have updated the question to clearly show what I mean in the above comment – User54211 Nov 27 '18 at 06:46
  • your update doesn't cover the mention of the weight that I have added in the edit update. I want to aggregate the marks based on weights assigned subject-wise. – User54211 Nov 27 '18 at 08:21
  • Your query has malformed json. Otherwise it seemed to be correct. – Nishant Nov 27 '18 at 08:45
  • Updated the answer, please check. You can remove `avg_score` block completely from the query if you don't require it. – Nishant Nov 27 '18 at 09:50
  • I continue to get this error - >{u'error': {u'col': 265, u'line': 1, u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]', u'root_cause': [{u'col': 265, u'line': 1, u'reason': u'Unknown BaseAggregationBuilder [weighted_avg]', u'type': u'unknown_named_object_exception'}], u'type': u'unknown_named_object_exception'}, u'status': 400} – User54211 Nov 27 '18 at 10:56
  • 1
    Can you copy paste your complete query as a seperate question. I or someone else can then look at it to find the issue. – Nishant Nov 27 '18 at 11:04
  • done here [link] (https://stackoverflow.com/questions/53498595/weighted-average-for-nested-aggregation-in-elastic-search) – User54211 Nov 27 '18 at 11:25