0

I am new to Elasticsearch. I tried to get result from ES using CData Elasticsearch ODBC driver. Is it possible to get sum of score field?

My code:

OdbcConnection connection = new OdbcConnection("Driver={CData ODBC Driver for Elasticsearch};server=localhost");
        connection.Open();
        string query = "select sum(_score) from ordersdetails";
        OdbcCommand odbcCommand = new OdbcCommand(query, connection);
        OdbcDataReader dataReader = odbcCommand.ExecuteReader();
        DataTable dataTable = new DataTable();
        dataTable.Load(dataReader);
        connection.Close();

I have faced the below exception

System.Data.Odbc.OdbcException: 'ERROR [HY000] The '_score' column is not applicable to the sum function.'

But the below query returns result:

"select _id, sum(_score) from ordersdetails group by _id"

Anybody know, Why I got exception when tried to get a result for a single column?

If you know the solution, please share with me.

Saqib Ali
  • 3,953
  • 10
  • 55
  • 100
Kavitha M
  • 263
  • 1
  • 7
  • 23
  • Please tell me does sum query work for other numeric fields? Do you realize that `_score` is a ["virtual" field](https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html) that is computed per each pair of (query, result)? – Nikolay Vasiliev Oct 11 '17 at 08:12
  • @NikolayVasiliev : Yes. sum query works for other numeric fields. Actually my use case is to getting ElasticSearch data through ODBC driver in application like Tableau, Power BI. I have not much knowledge about Elasticsearch. So I inserted data into elasticsearch using C# code. Please let us know, there is any possibility to get sum of score value using sql query? – Kavitha M Oct 11 '17 at 10:11
  • @NikolayVasiliev: Could you please provide me a example for usage of _score field? – Kavitha M Oct 11 '17 at 10:49
  • Thanks for the reply. `_score` field is the score of the document's relevance to the query fired. For instance, if you do a full text search on a set of tweets with query like `tweetText=google` ElasticSearch will return results in order of relevance and `_score` will contain the actual value that it used for sorting the results. That's why for me the use case from the question does not make sense - there is no query, and such `sum` does not give much information. May you provide your use case for summing the `_score`? – Nikolay Vasiliev Oct 11 '17 at 11:30
  • @NikolayVasiliev: Thanks a lot. – Kavitha M Oct 11 '17 at 12:27

1 Answers1

0

After doing a couple of experiments with pyodbc and ElasticSearch I come to these conclusions:

  1. CData ODBC driver knows that it is not possible to do aggregation on _score and does not let the user to do so
  2. Behavior where it actually computes aggregation by _score is most likely a bug and is performed not by ElasticSearch but by the driver.

In short, don't use _score for any GROUP BY, it's a special feature of ElasticSearch dedicated to relevance sorting.

A bit of introduction

As I already mentioned in the comments to the question, _score in ElasticSearch is a measure of how document is relevant to a given query (see docs):

The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.

This field is not a part of the document and is computed for every query and every document. In ElasticSearch _score is used for sorting. However, _score is not always computed, for instance when sorting on an existing field is required:

The _score is not calculated, because it is not being used for sorting.

Since this field is computed on-the-fly, it is not possible to create efficient aggregation, hence ElasticSearch does not allow this directly. However, this still can be achieved by using scripts in the aggregations.

CData ODBC Driver is aware of _score field

CData ODBC Driver is aware of _score field:

When the _score column is selected, scoring will be requested by issuing a query context request, which scores the quality of the search results. By default, results are returned in descending order based on the calculated _score. An ORDER BY clause can be specified to change the order of the returned results.

When the _score column is not selected, a filter context will be sent, in which case Elasticsearch will not compute scores. The results for these queries will be returned in arbitrary order unless an ORDER BY clause is explicitly specified.

Basically, this means that by explicitly mentioning _score in your query will make ODBC return such field (which might to be there by default).

The experiments

I installed pyodbc and set up ElasticSearch 5.4 at my localhost. I tuned ES to log all the queries it receives.

1.

At first I reproduced the first case:

cursor.execute("SELECT sum(_score) FROM my_index.my_type")

And received this exception:

[HY000] The '_score' column is not applicable to the sum function.

In the log of ES I catched this query:

{"from":0,"size":100}

2.

Next I took the second query:

cursor.execute("SELECT _id, sum(_score) FROM my_index.my_type GROUP BY _id")

Which executed without exceptions, but resulted in this ES query:

{"from":0,"size":10000,"_source":{"includes":["_id","_score"],"excludes":[]}}

3.

Then I tried to mock the library with non existing fields:

cursor.execute("SELECT sum(score42) FROM simple_index.simple_type")

In this case exception was different:

[HY000] 'score42' is not a valid column.

Although the query sent to ES was the same as in the first case.

4.

Then I tried to find out how does the library send aggregate requests:

cursor.execute("SELECT sum(likes) FROM simple_index.simple_type GROUP BY likes")

In fact, it did use ES aggregations:

{
  "from": 0,
  "size": 0,
  "aggregations": {
    "likes": {
      "terms": {
        "field": "likes",
        "size": 2147483647,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "order": [
          {
            "_count": "desc"
          },
          {
            "_term": "asc"
          }
        ]
      },
      "aggregations": {
        "sum_likes": {
          "sum": {
            "field": "likes"
          }
        }
      }
    }
  }
}

Conclusions

The fact that the library is able to recognize _score as a special keyword, and also because it did not attempt to generate ES aggregations when asked for sum(_score), I assume that it does not allow in general to do aggregations on _score and the "working" case here is likely a bug.

Nikolay Vasiliev
  • 5,656
  • 22
  • 31