Adding a new document to a separate index using Elasticsearch processors

Question

Is there a way to populate a separate index when I index some document(s)?

Let's assume I have something like:

PUT person/_doc/1
{
  "name": "Jonh Doe",
  "languages": ["english", "spanish"]
}

PUT person/_doc/2
{
  "name": "Jane Doe",
  "languages": ["english", "russian"]
}

What I want is that every time a person is added, a language is added to a language index.

Something like:

GET languages/_search

would give:

...
"hits" : [
  {
    "_index" : "languages",
    "_type" : "doc",
    "_id" : "russian",
    "_score" : 1.0,
    "_source" : {
      "value" : "russian"
    }
  },
  {
    "_index" : "languages",
    "_type" : "doc",
    "_id" : "english",
    "_score" : 1.0,
    "_source" : {
      "value" : "english"
    }
  },
  {
    "_index" : "languages",
    "_type" : "doc",
    "_id" : "spanish",
    "_score" : 1.0,
    "_source" : {
      "value" : "spanish"
    }
  }
...

Thinking of pipelines, but I don't see any processor that allow such a thing.

Maybe the answer is to create a custom processor. I have one already, but not sure how could I insert a document in a separate index there.

Update: Use transforms as described in @Val answer works, and seems to be the right answer indeed...

However, I am using Open Distro for Elasticsearch and transforms are not available there. Some alternative solution that works there would be greatly appreciated :)

Update 2: Looks like OpenSearch is replacing Open Distro for Elasticsearch. And there is a transform api \o/

score 2 · Accepted Answer · edited Aug 19 '21 at 13:28

Each document entering an ingest pipeline cannot be cloned or split like it is doable in Logstash for instance. So from a single document, you cannot index two documents.

However, just after indexing your person documents, it's definitely possible to hit the _transform API endpoint and create the languages index from the person one:

First create the transform:

PUT _transform/languages-transform
{
  "source": {
    "index": "person"
  },
  "pivot": {
    "group_by": {
      "language": {
        "terms": {
          "field": "languages.keyword"
        }
      }
    },
    "aggregations": {
      "count": {
        "value_count": {
          "field": "languages.keyword"
        }
      }
    }
  },
  "dest": {
    "index": "languages",
    "pipeline": "set-id"
  }
}

You also need to create the pipeline that will set the proper ID for your language documents:

PUT _ingest/pipeline/set-id
{
  "processors": [
    {
      "set": {
        "field": "_id",
        "value": "{{language}}"
      }
    }
  ]
}

Then, you can start the transform:

POST _transform/languages-transform/_start

And when it's done you'll have a new index called languages whose content is

GET languages/_search
=>
"hits" : [
  {
    "_index" : "languages",
    "_type" : "_doc",
    "_id" : "english",
    "_score" : 1.0,
    "_source" : {
      "count" : 4,
      "language" : "english"
    }
  },
  {
    "_index" : "languages",
    "_type" : "_doc",
    "_id" : "russian",
    "_score" : 1.0,
    "_source" : {
      "count" : 2,
      "language" : "russian"
    }
  },
  {
    "_index" : "languages",
    "_type" : "_doc",
    "_id" : "spanish",
    "_score" : 1.0,
    "_source" : {
      "count" : 2,
      "language" : "spanish"
    }
  }
]

Note that you can also set that transform on schedule so that it runs regularly, or you can run it manually whenever suits you, to rebuild the languages index.

OpenSearch has its own _transform API. It works slightly different, the transform could be created this way:

PUT _plugins/_transform/languages-transform
{
  "transform": {
    "enabled": true,
    "description": "Insert languages",
    "schedule": {
      "interval": {
        "period": 1,
        "unit": "minutes"
      }
    },
    "source_index": "person",
    "target_index": "languages",
    "data_selection_query": {
      "match_all": {}
    },
    "page_size": 1,
    "groups": [{
      "terms": {
        "source_field": "languages.keyword",
        "target_field": "value"
      }
    }]
  }
}

I am afraid we'll be using the "Open Distro for Elasticsearch" flavor. `_transform` cannot be used as it is part of `X-Pack` :(. Unless I am missing something. — DavidEG, Aug 13 '21 at 14:03
Nonetheless, I have tested it in a machine where I have the original Elasticsearch and it seems to work... except if I later add new people with new languages, they are not added :-/ Tried to `POST _transform/languages-transform/_start`, also `stop`, after adding the new persons, didn't work. — DavidEG, Aug 13 '21 at 14:08
The post didn't mention Opendistro, but yes, it's only available in the official version of Elasticsearch — Val, Aug 13 '21 at 14:28
In order to pick up new people, you need to set that transform on `sync` in order to let it run on a regular schedule — Val, Aug 13 '21 at 14:32
Sync worked. Also realized Open distro is being replaced by OpenSearch, which has its own transform API. Edited answer to add OpenSearch solution and accepted. Looks like a correct answer both for Elasticsearch and OpenSearch. Thanks! — DavidEG, Aug 19 '21 at 13:30

score 0 · Answer 2 · answered Jul 27 '21 at 14:40

0

You will just need to change your _index field name in the ingest pipeline:

{
  "description" : "sets the value of count to 1",
  "set": {
            "if": "[*your condition here*]",
            "field": "_index",
            "value": "languages",
            "override": true
        }
}

answered Jul 27 '21 at 14:40

Kaveh

1,158
6
16

For what I've just tested, that would allow me to reroute documents to `languages` index. But if i do `GET person/_search`, documents are not there. I want the original documents in the original index untouched, extracted information would go into the other index. – DavidEG Jul 28 '21 at 07:48
You can add it language and person index in value like “language,person”. – Kaveh Jul 28 '21 at 15:49
That doesn't seem to work. I get the error: `Invalid index name [person,languages], must not contain the following characters [ , \", *, \\, <, |, ,, >, /, ?]` – DavidEG Jul 29 '21 at 09:59
So in this case your best option would be to handle this logic in your application not in ES. – Kaveh Jul 29 '21 at 10:37

Adding a new document to a separate index using Elasticsearch processors

2 Answers2