0

I am stuck regarding an elastic search query using python

I have data such as:

{
  "_index": "user_log",
  "_type": "logs",
  "_id": "gdUJpXIBAoADuwvHTK29",
  "_score": 1,
  "_source": {
    "user_name": "prathameshsalap@gmail.com",
    "working_hours": "2019-10-21 09:00:01",
}

{
  "_index": "user_log",
  "_type": "logs",
  "_id": "gtUJpXIBAoADuwvHTK29",
  "_version": 1,
  "_score": 0,
  "_source": {
    "user_name": "vaishusawant143@gmail.com",
    "working_hours": "2019-10-21 09:15:01",
}

{
  "_index": "user_log",
  "_type": "logs",
  "_id": "g9UJpXIBAoADuwvHTK29",
  "_version": 1,
  "_score": 0,
  "_source": {
    "user_name": "prathameshsalap@gmail.com",
    "working_hours": "2019-10-22 07:50:00",
}

{
  "_index": "user_log",
  "_type": "logs",
  "_id": "g8UJpXIBAoADuwvHTK29",
  "_version": 1,
  "_score": 0,
  "_source": {
    "user_name": "vaishusawant143@gmail.com",
    "working_hours": "2019-10-22 04:15:01",
}

Here, for each user give working hours for different date(21 and 22). I want to take an average of each user's working hours.

{
    "size": 0,
    "query" : {"match_all": {}},
     "aggs": {
      "users": {
          "terms": {
              "field": "user_name"
          },
          "aggs": {
              "avg_hours": {
                  "avg": {
                      "field": "working_hours"
                  }
              }
          }
      }
  }
}

This query not working. How to find the average working hours for each user for all dates? And, I also want to run this query using python-elastic search.

Updated When I use ingest pipeline as @Val mention. I am getting an error:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "compile error",
        "processor_type" : "script",
        "script_stack" : [
          "\n        def workDate = /\\s+/.split(ctx.working_h ...",
          "                        ^---- HERE"
        ],
        "script" : "\n        def workDate = /\\s+/.split(ctx.working_hours);\n        def workHours = /:/.split(workDate[1]);\n        ctx.working_minutes = (Integer.parseInt(workHours[0]) * 60) + Integer.parseInt(workHours[1]);\n        ",
        "lang" : "painless",
        "position" : {
          "offset" : 24,
          "start" : 0,
          "end" : 49
        }
      }
.....

How can I solve it?

Shivam Gupta
  • 163
  • 2
  • 14
  • The problem is that your `working_hours` field is a point in time and does not denote a duration. For this use case, it's best to store the working day and working hours in two separate fields and store the working hours in minutes. – Val Jun 12 '20 at 07:10
  • How to do that? @Val – Shivam Gupta Jun 12 '20 at 07:11

1 Answers1

1

The problem is that your working_hours field is a point in time and does not denote a duration.

For this use case, it's best to store the working day and working hours in two separate fields and store the working hours in minutes.

So instead of having documents like this:

{
    "user_name": "prathameshsalap@gmail.com",
    "working_hours": "2019-10-21 09:00:01",
}

Create documents like this:

{
    "user_name": "prathameshsalap@gmail.com",
    "working_day": "2019-10-21",
    "working_hours": "09:00:01",
    "working_minutes": 540
}

Then you can use your query on the working_minutes field:

{
    "size": 0,
    "query" : {"match_all": {}},
     "aggs": {
      "users": {
          "terms": {
              "field": "user_name.keyword",
              "order": {
                 "avg_hours": "desc"
              }
          },
          "aggs": {
              "avg_hours": {
                  "avg": {
                      "field": "working_minutes"
                  }
              }
          }
      }
  }
}

If it is not convenient to compute the working_minutes field in your client code, you can achieve the same thing using an ingest pipeline. Let's define the pipeline first:

PUT _ingest/pipeline/working-hours
{
  "processors": [
    {
      "dissect": {
        "field": "working_hours",
        "pattern": "%{?date} %{tmp_hours}:%{tmp_minutes}:%{?seconds}"
      }
    },
    {
      "convert": {
        "field": "tmp_hours",
        "type": "integer"
      }
    },
    {
      "convert": {
        "field": "tmp_minutes",
        "type": "integer"
      }
    },
    {
      "script": {
        "source": """
        ctx.working_minutes = (ctx.tmp_hours * 60) + ctx.tmp_minutes;
        """
      }
    },
    {
      "remove": {
        "field": [
          "tmp_hours",
          "tmp_minutes"
        ]
      }
    }
  ]
}

Then you need to update your Python client code to use the new pipeline that will create the working_hours field for you:

helpers.bulk(es, reader, index='user_log', doc_type='logs', pipeline='working-hours') 
Val
  • 207,596
  • 13
  • 358
  • 360
  • So, How to extract date and hours? And, how to convert hours to min using elastic query? – Shivam Gupta Jun 12 '20 at 07:19
  • 1
    Since you want an average of working hours, what date do you want? Converting hours to minutes is easy, you can do it in your client code: `working_minutes = hour * 60 + minutes` – Val Jun 12 '20 at 07:26
  • I don't want any particular date. I want to work hours of each user for all dates. Like here for the user 'a' working on 21(7h:30min) and 22(3h:00min) dates. so for the user, 'a' avg working hours is 5h:15min. I hope you got it what I want. Thank you. – Shivam Gupta Jun 12 '20 at 07:33
  • So you just need the average working hours of each employee and that's what the above query gives you. If you ever need a date range constraint you can add it to the query part. – Val Jun 12 '20 at 07:35
  • yeah, the above query is fine. but the only problem is extracting time filed from working hours and convert into min using elastic search. There is any function for extracting time and convert into min in elastic DSL query? – Shivam Gupta Jun 12 '20 at 07:40
  • How do you index your data? – Val Jun 12 '20 at 07:41
  • I have a CSV file. using bulk with open(path) as files: reader = csv.DictReader(files) helpers.bulk(es, reader, index='user_log', doc_type='logs') – Shivam Gupta Jun 12 '20 at 07:45
  • It's given error for the pipeline. def workDate = /\\s+/.split(ctx.working_h ...", " ^---- HERE" – Shivam Gupta Jun 12 '20 at 08:15
  • Are you sure that all your documents have a non-null `working_hours` field? – Val Jun 12 '20 at 08:16
  • when I run the ingest pipeline it's given error. and my working_hours not having a null value. – Shivam Gupta Jun 12 '20 at 08:34
  • Can you update your question with the full error please? – Val Jun 12 '20 at 08:35
  • Let's do it differently without regular expressions. See my updated answer – Val Jun 12 '20 at 09:20
  • Thank you. It's working. how to use ingest pipeline with python elasticsearch? – Shivam Gupta Jun 12 '20 at 11:26
  • When I run the query it's given error: "type" : "illegal_argument_exception", "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [user_name] in order to load field data by uninverting the inverted index. Note that this can use significant memory." – Shivam Gupta Jun 12 '20 at 11:48
  • 1
    You should use the `user_name.keyword` field instead of `user_name` in the terms aggregation. – Val Jun 12 '20 at 11:50
  • If I want to sort user_name with respect to avg working hours. where do I add the sort function in the query? @Val – Shivam Gupta Jun 12 '20 at 14:42
  • One more question I want to know how to use multiple filed not match_all{}. Like I have an id, user_name, working_hours, phone_no, Address, working_minutes. I only want user_name and working_minutes. Thanks for your help. @Val – Shivam Gupta Jun 12 '20 at 15:54
  • I think it's best if you create a new question, this one is solved. – Val Jun 12 '20 at 16:04
  • I have created a new question-related with this https://stackoverflow.com/questions/62349069/elasticsearch-select-multiple-fields-in-query-not-match-all @val – Shivam Gupta Jun 12 '20 at 17:05