0

I have an index with millions of documents and it gets new documents periodically. I created an ingest pipeline for it. But I only want it to work on the new incoming documents because previous document count is huge.

I connected my index and ingest pipeline using _reindex like this:

POST _reindex
{
  "source": {
    "index": "index*"
  },
  "dest": {
    "index": "new_index",
    "pipeline": "pipeline"
  }
}

also my current pipeline is as follows:

{
  "processors": [
    {
      "gsub": {
        "field": "my_field",
        "pattern": "regex",
        "replacement": ""
      }
    }
  ]
}

This ingest pipeline tries to work on every document on the index. But I only want it to work on the new upcoming data. How can I achieve this?

fehim
  • 5
  • 3

1 Answers1

0

You don't need a _reindex for this, otherwise you're basically running it on all existing documents.

You simply need to configure your index with a default_pipeline setting:

PUT index*/_settings
{
   "index.default_pipeline": "pipeline"
}

UPDATE:

There's no feature in ES that automatically triggers the indexing of a document in i2 based on the indexing of a document in i1. You could probably achieve something close to what you expect using something like Logstash that regularly polls an index (every minute) for documents arrived during the last minutes and sends them documents to a second index through your pipeline, but that's a solution outside of Elasticsearch

input {
  elasticsearch {
    hosts => "localhost:9200"
    index => "i1"
    schedule => "* * * * *"
    query => '{ "query": { "range": { "@timestamp": { "gt": "now-1m"} } } }'
  }
}
output {
  elasticsearch {
    hosts => "localhost:9200"
    index => "i2"
    pipeline => "my_pipeline"
  }
}
Val
  • 207,596
  • 13
  • 358
  • 360
  • But this outputs the results into the existing index right? I cant change the existing index because a lot of code uses the existing index and particularly that field. I need outputs in a new index. – fehim Mar 21 '23 at 13:35
  • Where is the "new upcoming data" being indexed into? – Val Mar 21 '23 at 13:46
  • I've got index with millions of documents lets call it i1. I've got a new empty index lets call it i2. I1 index is the index that i'm not allowed to touch so no changes are allowed. I created ingest pipeline and whenever new documents are added to i1 i want the processed documents on i2. I couldn't find anything that accomplishes this. – fehim Mar 21 '23 at 13:51
  • 1) Shall this sync between i1 and i2 be run manually or automatically? 2) Is it possible that your indexing process feeds i1 AND i2 at the same time? – Val Mar 21 '23 at 14:12
  • 1) Yes it should be performed automatically, if possible every time a document is inserted but if not every second or every minute is ok too 2) No only i1 will feed the data to ingest pipeline and i2 will be on the receiving end. – fehim Mar 22 '23 at 11:54
  • Ok, there's no such thing in ES that triggers the indexing of a document in i2 based on the indexing of a document in i1. You can achieve something close to what you expect using for instance Logstash that regularly polls an index and sends the new documents to a second index through your pipeline, but that's a solution outside of Elasticsearch. I've updated my answer accordingly – Val Mar 22 '23 at 12:06