Blue/Green "deployment" of elasticsearch data?

Question

I am planning on extracting (essentially scraping, with permission) some data from a web-page and store that in elasticsearch (you know, for search).

While I have permission to scrape the data from the site,

there is no API or another structured source for this data
it's manually authored straight into HTML
there are no unique identifiers that differentiate one entry from another (I will essentially be extracting around 1,000-5,000 entries from the DOM).

When I store this in es, I am planning to put this into one index and into a mapping type, say thing.

However, over time, the source (the HTML web page) is likely to change as they add/remove/change content of some of these entries. Since there are no identifiers in the source, I can't easily identify new ones (and even worse, deleted ones or changed ones).

I want to keep my es index up to date and what I am thinking is some sort of a blue-green mechanism:

I run the extraction process at some schedule (daily/weekly) depending on the velocity of the source changing
Every time it runs the process produces another index (or could be a new cluster altogether). Say the current index is index-prod and the new one built by the process is index-rc (release candidate)
It validates index-rc based on some heuristics (a flexible velocity check on the number of entries, sample queries that we know should work etc.)
And if it's valid, it either:
- A. slowly flips queries into the new cluster/index
- or B. flips in one shot to the new cluster/index

I am planning on hosting the elasticsearch cluster using AWS Elastisearch Service and could possibly concote something using Route 53 CNAMEs (and maybe ELB?) but I wanted to know if there is a more implicit support in elasticsearch itself for doing this?

Essentially, I want to swap one index's data for another.

score 3 · Accepted Answer · answered Nov 02 '16 at 17:31

You don't need to swap the entire data between indexes... if I get it right, you can use Aliases to change from the actual to the next index version.

To slowly change the queries endpoint, I also suppose that some Load Balancer, like nginx, is the best solution. There are many cases about this on the web.

score 1 · Answer 2 · answered Nov 02 '16 at 17:24

1

I think you could use the Reindex API for this.

answered Nov 02 '16 at 17:24

jhilden

12,207
5
53
76

reindex looks useful to copy from one to another. I need to re-seed the index. aliases look like the right solution based on the other answers. – or9ob Nov 02 '16 at 17:47

score 0 · Answer 3 · answered Nov 02 '16 at 17:29

I heard back on this from another source outside of Stackoverflow.

indice aliases solve this problem.
This is a common occurrence for time-series data. An example of atomically swapping indexes, marking one as current and purging old ones is documented in the ES time series documentation.

Blue/Green "deployment" of elasticsearch data?

3 Answers3