6

I'm trying to achieve Mysql like behaviour adding inserted_at/updated_at on metadata for each doc I index through ES pipeline.

My pipeline is like:

{
  "description": "Adds createdAt and updatedAt style timestamps",
  "processors": [
    {
      "set": {
        "field": "_source.indexed_at",
        "value": "{{_ingest.timestamp}}",
        "override": false
      }
    },
    {
      "set": {
        "field": "_source.updated_at",
        "value": "{{_ingest.timestamp}}",
        "override": true
      }
    }
  ]
}

I have like no mapping only tried it adding one doc:

POST test_pipelines/doc/1?pipeline=timestamps
{
  "foo": "bar"
}

The pipeline successful creates indexed_at and updated_at:

{
  "_index": "test_pipelines",
  "_type": "doc",
  "_id": "1",
  "_score": 1,
  "_source": {
    "indexed_at": "2018-07-12T10:47:27.957Z",
    "updated_at": "2018-07-12T10:47:27.957Z",
    "foo": "bar"
  }
}

But if I try to update the doc 1 the field indexed_at it's changing every time to the same date the document it's updated.

Update request example:

POST test_pipelines/doc/1?pipeline=timestamps
{
  "foo": "bor"
}

There's any way to tell processor to not update indexed_at field?

jordivador
  • 1,016
  • 11
  • 26

1 Answers1

8

The reason this is happening is because the set processor will only operate within the context of the document you're sending, not the one stored (if any). Hence, override has no effect here since the document you send does neither contain indexed_at nor updated_at, which is the reason why both fields are set on each call.

When you PUT your document a second time, you're not updating it, you're actually re-indexing it from scratch (i.e. you're overriding the first version you sent). Ingest pipelines do not work with update operations. For instance, if you try the following update call, it will fail.

POST test_pipelines/doc/1/_update?pipeline=timestamps
{
  "doc": {
    "foo": "bor"
  }
}

If you want to stick with your ingest pipeline, the only way to make it work is to GET the document first and then update the field(s) you want. For instance,

# 1. index the document the first time
PUT test_pipelines/doc/1?pipeline=timestamps
{
  "foo": "bar"
}

# 2. GET the indexed document
GET test_pipelines/doc/1

# 3. update the foo field and index it again
PUT test_pipelines/doc/1?pipeline=timestamps
{
  "indexed_at": "2018-07-20T05:08:52.293Z",
  "updated_at": "2018-07-20T05:08:52.293Z",
  "foo": "bor"
}

# 4. When you GET the document the second time, you'll see your pipeline worked
GET test_pipelines/doc/1

This will return:

{
  "indexed_at": "2018-07-20T05:08:52.293Z",
  "updated_at": "2018-07-20T05:08:53.345Z",
  "foo": "bor"
}

I definitely agree this is really troublesome, but the link I gave above enumerates all the reasons why pipelines are not supported on update operations.

Another way to make it work the way you like (without pipelines) would be to use a scripted upsert operation (which works like steps 2 and 3 above, i.e. GETs and PUTs the document in a single atomic operation), and that would also work with your bulk calls. It basically goes like this. First you need to store a script that you will call for both your indexing and update operations:

POST _scripts/update-doc
{
  "script": {
    "lang": "painless",
    "source": "ctx._source.foo = params.foo; ctx._source.updated_at = new Date(); if (ctx._source.indexed_at == null) ctx._source.indexed_at = ctx._source.updated_at;"
  }
}

Then, you can index your document the first time like this:

POST test_pipelines/doc/1/_update
{
  "script": {
    "id": "update-doc",
    "params": {
      "foo": "bar"
    }
  },
  "scripted_upsert": true,
  "upsert": {}
}

The indexed document will look like this:

{
    "updated_at": "2018-07-20T05:57:40.510Z",
    "indexed_at": "2018-07-20T05:57:40.510Z",
    "foo": "bar"
}

And you can use the exact same call when updating the document:

POST test_pipelines/doc/1/_update
{
  "script": {
    "id": "update-doc",
    "params": {
      "foo": "bor"             <--- only this changes
    }
  },
  "scripted_upsert": true,
  "upsert": {}
}

The updated document will look like this, exactly what you wanted:

{
    "updated_at": "2018-07-20T05:58:42.825Z",
    "indexed_at": "2018-07-20T05:57:40.510Z",
    "foo": "bor"
}
Val
  • 207,596
  • 13
  • 358
  • 360
  • Thanks for the detailed response @Val . I will try the update-script apporach. In other hand I've tried with `update_by_query`, the pipelines are working as I expected, as I understand `update_by_query` is getting all _source and reindexing with it making pipeline deal with date as expected. – jordivador Jul 20 '18 at 12:36
  • Cool, yes update by query would work, too, though not ideal for updating a single document – Val Jul 20 '18 at 12:47