Requirements:
- A single ElasticSearch index needs to be constructed from a bunch of flat files that gets dropped every week
- Apart from this weekly feed, we also get intermittent diff files, providing additional data that was not a part of the original feed (insert or update, no delete)
- The time to parse and load these files (weekly full feed or the diff files) into ElasticSearch is not very huge
- The weekly feeds received in two consecutive weeks are expected to have significant differences (deletes, additions, updates)
- The index is critical for the apps to function and it needs to have close to zero downtime
- We are not concerned about the exact changes made in a feed, but we need to have the ability to rollback to the previous version in case the current load fails for some reason
- To state the obvious, searches need to be fast and responsive
Given these requirements, we are planning to do the following:
- For incremental updates (diff) we can insert or update records as-is using the bulk API
- For full updates we will reconstruct a new index and swap the alias as mentioned in this post. In case of a rollback, we can revert to the previous working index (backups are also maintained if the rollback needs to go back a few versions)
Questions:
- Is this the best approach or is it better to CRUD documents on the previously created index using the built-in versioning, when re-constructing an index?
- What is the impact of modifying data (delete, update) to the underlying lucene indices/shards? Can modifications cause fragmentation or inefficiency?