I have an elasticsearch index which has multiple documents, now I want to update the index with some new documents which might also contain duplicates of the existing documents. What's the best way to do this? I'm using elasticsearch py for all CRUD operations
Asked
Active
Viewed 6,313 times
1 Answers
2
Every update in elasticsearch deletes the old document and create a new document as the smallest unit of document collection is called segments in elastic-search which are immutable, hence when you index a new document or update any exiting documents, it gets into the new segments which are merged into bigger segments during the merge process.
Now even if you have duplicate data but with the same id, it will replace the existing document, and its fine and more performant than first fetching the document and than comparing both the documents to see if they are duplicate and than discard the update/upsert request from an application, rather than just index whatever if coming and ES will again insert the duplicate docs.

Amit
- 30,756
- 6
- 57
- 88
-
We aren't storing the ids and that's where the challenge lies. Without replacing the id, i would have duplicate of the same record twice. Any other alternate ways to get around this? – David Sep 28 '20 at 09:48
-
@David you gotta create a unique ID. E.g. hash of your text field. This way it gets rewritten each time the document is updated. – winwin Jun 06 '23 at 12:40