2

Requirements:

  • A single ElasticSearch index needs to be constructed from a bunch of flat files that gets dropped every week
  • Apart from this weekly feed, we also get intermittent diff files, providing additional data that was not a part of the original feed (insert or update, no delete)
  • The time to parse and load these files (weekly full feed or the diff files) into ElasticSearch is not very huge
  • The weekly feeds received in two consecutive weeks are expected to have significant differences (deletes, additions, updates)
  • The index is critical for the apps to function and it needs to have close to zero downtime
  • We are not concerned about the exact changes made in a feed, but we need to have the ability to rollback to the previous version in case the current load fails for some reason
  • To state the obvious, searches need to be fast and responsive

Given these requirements, we are planning to do the following:

  1. For incremental updates (diff) we can insert or update records as-is using the bulk API
  2. For full updates we will reconstruct a new index and swap the alias as mentioned in this post. In case of a rollback, we can revert to the previous working index (backups are also maintained if the rollback needs to go back a few versions)

Questions:

  1. Is this the best approach or is it better to CRUD documents on the previously created index using the built-in versioning, when re-constructing an index?
  2. What is the impact of modifying data (delete, update) to the underlying lucene indices/shards? Can modifications cause fragmentation or inefficiency?
user1452030
  • 1,001
  • 3
  • 10
  • 18

1 Answers1

2
  1. At first glance, I'd say that your overall approach is sound. Creating a new index every week with the new data and swapping an alias is a good approach if you need

    • zero downtime and
    • to be able to rollback to the previous indices for whatever reason

If you were to keep only one index and CRUD your documents in there, you'd not be able to rollback if anything goes wrong and you could end up in a mixed state with data from the current week and data from the week earlier.

  1. Every time you update (even one single field) or delete a document, the previous version will be flagged as deleted in the underlying Lucene segment. When the Lucene segments have grown sufficiently big, ES will merge them and wipe out the deleted documents. However, in your case, since you're creating an index every week (and eventually delete the index from the week prior), you won't land into a situation where you'll have space and/or fragmentation issues.
Val
  • 207,596
  • 13
  • 358
  • 360
  • His data is bound to be similar from week to week, would it be a very bad idea to create a type per week?(e.g. week1, week2...etc) Instead of creating the full index every time that is – Adonis Jul 26 '17 at 06:57
  • 1
    @asettouf See this other answer: https://stackoverflow.com/questions/45204579/elasticsearch-index-vs-type-and-handling-updates/45204724#45204724 (hint: types are going away) – Val Jul 26 '17 at 07:02
  • Thanks (again) Val :) The question on fragmentation was for the scenario where CRUD is performed on an existing index. In other words, I was trying to see if there were additional drawbacks in sticking to a single index and doing bulk updates and deletes on that (making ES work hard to keep up with data organization, while users are pounding the index for data) – user1452030 Jul 26 '17 at 12:23
  • 1
    When using a single index, it's harder to guarantee that your data is always consistent, especially if your weekly batch update crashes in the middle. Anyway, let's say that your weekly update always succeeds, fragmentation will probably never be an issue. But again, [it depends](https://www.elastic.co/blog/it-depends) on how many documents we're talking, your hardware spec, etc – Val Jul 26 '17 at 12:28