1

I have an Oracle DB. Logstash retrieves data from Oracle and puts it to ElasticSearch.


But when Logstash makes planned export every 5 minutes, ElasticSearch filled with copies cause old data still exist. This is an obvious situation. Oracle's condition almost not changed during this 5 minutes. Let's say - added 2-3 rows, and 4-5 deleted.

How can we replace old data with new without copies?


For example:

  1. Delete the whole old index;
  2. Create new index with the same name and make the same configuration (nGram configuration and mapping);
  3. Add all new data;
  4. Wait for 5 minutes and repeat.
Kiryl A.
  • 163
  • 2
  • 15
  • You need to use a unique id like userid for example from your Oracle data and use that as document id when ingesting into Elasticsearch. That way if the same row in database is updated the corresponding document in Elasticsearch is overwritten with the new data. New rows are added as new documents anyway. BUT, this doesn't work when rows are deleted from Oracle because logstash can't read data that is deleted. – ben5556 Nov 28 '18 at 19:37
  • Please note, that tampering with the documents _id will lead to a performance drop as the balanced distribution (of documents across the shards in the given index) will not work very well. – ibexit Nov 28 '18 at 19:49
  • @ibexit is it documented somewhere or are you speaking from personal experience ? Because the documentation says `document_id` attribute is useful for overwriting existing entries in ES. See https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-document_id – ben5556 Nov 29 '18 at 02:42
  • @ben5556: Yes, this option is available in the ES-Output plugin and very handy in some cases. But nevertheless it's not good for your ES-Index performance. Please have a look on this: https://www.elastic.co/guide/en/elasticsearch/reference/6.5/tune-for-indexing-speed.html#_use_auto_generated_ids and https://github.com/elastic/elasticsearch/issues/5941 and https://github.com/ppearcy/elasticflake and https://stackoverflow.com/questions/53214628/elasticsearch-how-does-sharding-affect-indexing-performance/53216210#53216210 – ibexit Nov 29 '18 at 10:41
  • @ibexit cheers good to know thanks for sharing – ben5556 Nov 29 '18 at 19:06
  • @ben5556 Hey! You said that if we declare some field in the index (for example, an id) as unique, then when Logstash launched it will simply replace the data with the identical id. My question is: how can I declare a field as unique? – Kiryl A. Feb 05 '19 at 19:04
  • Hi, within logstash output config use a column with unique values like userid as the value for document_id so if the table gets updated for that row the corresponding document in elasticsearch gets updated too. Hope this helps – ben5556 Feb 10 '19 at 10:45
  • @ben5556 It is! Thank you! – Kiryl A. Feb 10 '19 at 16:26

2 Answers2

2

It's pretty easy: create a new index for each import and apply the mappings, switch your alias afterwards to the most recent index. Remove old indices if needed. Your currenr data will be always searchable while indexing the most recent data.

Here are the sources you'll probalbly need to read:

  1. Use aliases (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html) to point to the most current data when searching in elasticsearch (BTW it`s always a good idea to have aliases in place).

  2. Use rollover api (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html) to create a new index for each import run - note the alias handling here too.

  3. Use index templates (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html) to autmatically apply the mappings/settings for your newly created indices.

  4. Shrink, close and/or delete old indices to keep your cluster handling data you really need. Have a look on the curator (https://github.com/elastic/curator) as standalone tool.

ibexit
  • 3,465
  • 1
  • 11
  • 25
  • Yes, probably I can't without the Curator. Nevertheless, it is very strange that ES authors created Logstash and did automatic data filling, but they did not think of deleting old data. – Kiryl A. Nov 28 '18 at 19:57
0

You just need to use the fingerprint/hash of each document , or hash of the uniq fields in each document , as the document id , so that eveytime you can overwirte the same documents with updated one , in place , while adding new documents as well.

But this approach will not work with deleting data from oracle.

Ijaz Ahmad
  • 11,198
  • 9
  • 53
  • 73
  • Please see my comment above. This applies here as well. You'll need a pretty good fingerprint/hash function in order to not break the document distribution which is based on the documents id. – ibexit Nov 28 '18 at 21:02