I am planning on extracting (essentially scraping, with permission) some data from a web-page and store that in elasticsearch (you know, for search).
While I have permission to scrape the data from the site,
- there is no API or another structured source for this data
- it's manually authored straight into HTML
- there are no unique identifiers that differentiate one entry from another (I will essentially be extracting around 1,000-5,000 entries from the DOM).
When I store this in es, I am planning to put this into one index and into a mapping type, say thing
.
However, over time, the source (the HTML web page) is likely to change as they add/remove/change content of some of these entries. Since there are no identifiers in the source, I can't easily identify new ones (and even worse, deleted ones or changed ones).
I want to keep my es index up to date and what I am thinking is some sort of a blue-green mechanism:
- I run the extraction process at some schedule (daily/weekly) depending on the velocity of the source changing
- Every time it runs the process produces another index (or could be a new cluster altogether). Say the current index is
index-prod
and the new one built by the process isindex-rc
(release candidate) - It validates
index-rc
based on some heuristics (a flexible velocity check on the number of entries, sample queries that we know should work etc.) - And if it's valid, it either:
- A. slowly flips queries into the new cluster/index
- or B. flips in one shot to the new cluster/index
I am planning on hosting the elasticsearch cluster using AWS Elastisearch Service and could possibly concote something using Route 53 CNAMEs (and maybe ELB?) but I wanted to know if there is a more implicit support in elasticsearch itself for doing this?
Essentially, I want to swap one index's data for another.