0

I have elasticsearch documents like below where I need to rectify age value based on creationtime currentdate

age = creationtime - currentdate

:

hits = [
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2018-05-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2013-07-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2014-08-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2015-09-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   }
]

I want to do bulk update based on each document ID, but the problem is I need to correct 6 months of data & per data size (doc count of Index) is almost 535329, I want to efficiently do bulk update on age based on _id for each day on all documents using python.

Is there a way to do this, without looping through, all examples I came across using Pandas dataframes for update is based on a known value. But here _id I will get as and when the code runs.

The logic I had written was to fetch all doc & store their _id & then for each _id update the age . But its not an efficient way if I want to update all documents in bulk for each day of 6 months.

Can anyone give me some ideas for this or point me in the right direction.

narendra-choudhary
  • 4,582
  • 4
  • 38
  • 58
  • What exactly do you need the `_id` for? Do you do another lookup with that id to update the age? Or is the age rather based on the difference of the two timestamps? – Joe - GMapsBook.com Feb 19 '21 at 14:14
  • @JoeSorocin _id I need so that by mistake I do not update some other document. Just to keep track of which doc I am updating or in case of failure, which document was last updated – Jermy Fields Feb 19 '21 at 14:14
  • That won't be a problem because the updates will be atomic. Can you explain a little more the logic behind the `age` calculation? Is is just the timestamp difference or does anything else come into play? – Joe - GMapsBook.com Feb 19 '21 at 14:18
  • @JoeSorocin age is just a difference , so creationtime is what we are reading from the original data from the server & currentdate we are inserting as the the time when the document was inserted – Jermy Fields Feb 19 '21 at 14:31

1 Answers1

0

As mentioned in the comments, fetching the IDs won't be necessary. You don't even need to fetch the documents themselves!

A single _update_by_query call will be enough. You can use ChronoUnit to get the difference after you've parsed the dates:

POST your-index-name/_update_by_query
{
  "query": {
    "match_all": {}
  },
  "script": {
    "source": """
      def created =  LocalDateTime.parse(ctx._source.creationtime, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss"));

      def currentdate = LocalDateTime.parse(ctx._source.currentdate, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
    
      def months = ChronoUnit.MONTHS.between(created, currentdate);
      ctx._source._age = months + ' month' + (months > 1 ? 's' : '');
    """,
    "lang": "painless"
  }
}

The official python client has this method too. Here's a working example.

Try running this update script on a small subset of your documents before letting in out on your whole index by adding a query other than the match_all I put there.


It's worth mentioning that unless you search on this age field, it doesn't need to be stored in your index because it can be calculated at query time.

You see, if your index mapping's dates are properly defined like so:

{
  "mappings": {
    "properties": {
      "creationtime": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ss"
      },
      "currentdate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      },
      ...
    }
  }
}

the age can be calculated as a script field:

POST ttimes/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "age_calculated": {
      "script": {
        "source": """
          def months = ChronoUnit.MONTHS.between(
                          doc['creationtime'].value,
                          doc['currentdate'].value );
          return months + ' month' + (months > 1 ? 's' : '');
        """
      }
    }
  }
}

The only caveat is, the value won't be inside of the _source but rather inside of its own group called fields (which implies that more script fields are possible at once!).

"hits" : [
  {
    ...
    "_id" : "FFfPuncBly0XYOUcdIs5",
    "fields" : {
      "age_calculated" : [ "32 months" ]   <--
    }
  },
  ...
Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
  • I am facing a weird issue. It seems to be with my mapping. This is my mapping in ES : ```{ "currentdate":{ "type":"date", "format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM-dd-HH:mm" }, "creationdatetime":{ "type":"date", "format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM-dd-HH:mm" } }``` . I get this error while inserting to ES : ```u'failed to parse date field [2019-12-24T15:34:52] with format [yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM-dd-HH:mm]```. – Jermy Fields Feb 20 '21 at 13:25
  • This is how I am formatting it : ```df_final['creationdatetime'] = pd.to_datetime((df_final['creationtime'].astype(str).str[:10]).astype(int), unit='s') df_final['creationdatetime'] = pd.to_datetime(df_final['creationdatetime'], format = 'yyyy-MM-dd HH:mm:ss')``` – Jermy Fields Feb 20 '21 at 13:25
  • However when I print the final dataframe in a .csv file, the output is correct : ```currentdate, creationtime, creationdatetime 2021-02-02 00:00:00, 1587132656820, 2020-04-17 14:10:56 2021-02-02 00:00:00, 1587243135874, 2020-04-18 20:52:15 ``` – Jermy Fields Feb 20 '21 at 13:27
  • You're missing the `'T'` separator in your format from the first comment. Try this: `yyyy-MM-dd HH:mm:ss||yyyy-MM-dd'T'HH:mm:ss||yyyy-MM-dd||yyyy-MM-dd-HH:mm`. – Joe - GMapsBook.com Feb 20 '21 at 13:35
  • My doubt is why I need to specify ***"T"*** in format of ***creationdatetime*** – Jermy Fields Feb 20 '21 at 13:35
  • The docs from your example seem to include it. – Joe - GMapsBook.com Feb 20 '21 at 13:39
  • is there a way to avoid it, I see in your mapping also you have included "T" for creationtime but not for currentdate. In that case will I be able to be do normal date subtraction ? – Jermy Fields Feb 20 '21 at 13:50
  • I didn't see the `'T'` in `currentdate` so I left it out. TBH it confused me too. – Joe - GMapsBook.com Feb 20 '21 at 13:57
  • I changed my mapping with "T" in format , data got inserted successfully, when I try to update using above method, I get below error ```"cause": { "type": "version_conflict_engine_exception", "reason": "[vuDBv3cB_uqfwo-WAWct]: version conflict, required seqNo [45001], primary term [1]. current document has seqNo [220001] and primary term [1]", "index_uuid": "jl6-JnfkSAOMAPie9wUXnA", "shard": "0", "index": "dummy" }, "status": 409``` – Jermy Fields Feb 20 '21 at 14:15
  • ```{ "query": {"bool": {"should": [{"range": {"currentdate": {"gte": "2021-02-01 00:00:00","lt": "2021-02-03 00:00:00"}}}]}}, "script": { "source": """ def created = LocalDateTime.parse(ctx._source.creationdatetime, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss")); def currentdate = LocalDateTime.parse(ctx._source.recorddate, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")); def months = ChronoUnit.MONTHS.between(created, currentdate); ctx._source.age = months + ' month' + (months > 1 ? 's' : ''); """, "lang": "painless"} }``` – Jermy Fields Feb 20 '21 at 14:19
  • Retry your request with a `?retry_on_conflict=3` parameter. More info [here](https://stackoverflow.com/a/64040923/8160318). – Joe - GMapsBook.com Feb 20 '21 at 14:19
  • I am using ES 7.2, still I get ```"reason": "request [/dummy*/_update_by_query] contains unrecognized parameter: [retry_on_conflict]"```. I can see in official doc, ***retry_on_conflict*** is there – Jermy Fields Feb 20 '21 at 14:36
  • If I use ```POST /_update_by_query?conflicts=proceed```. I get ```"error": "Gateway Time-out",``` – Jermy Fields Feb 20 '21 at 14:42
  • Sorry, use `retry_on_conflict` instead. If it times out, I think it's "fine" because it'll start a background process which you can monitor with `GET /_tasks` ([docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html)). – Joe - GMapsBook.com Feb 20 '21 at 15:07