1

I understand that ElasticSearch only marks documents as deleted and does not reclaim the disk space. To do this you need a forcemerge: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html#indices-forcemerge

But there are warnings against the use of this call that speak of all kinds of unthinkable doom if you use it.

However, GDPR compliance means documents must be deleted - really deleted, not just hidden. So you have to use this command sometimes, don't you? (I guess encrypting the data at rest mitigates against this.)

But even if you ignore GDPR compliance your index will eventually fill your disk, won't it? Then what?

And if you do choose to use this command should you close your index first for performance considerations (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-open-close.html) and then re-open it when the operation has completed?

I'm relatively new to ElasticSearch so be gentle :-)

TVMIA,

Adam.

Adam Benson
  • 7,480
  • 4
  • 22
  • 45
  • `But even if you ignore GDPR compliance your index will eventually fill your disk, won't it? Then what?` No because merging happens all the time in the background. Force merge is only useful when you are not writing to your index anymore and want to merge everything into a single segment for archiving purposes (e.g.) – Val Sep 14 '18 at 13:49
  • Thanks, Val - so basically, I don't need to worry about this at all. It's automagically taken care of. – Adam Benson Sep 14 '18 at 13:56
  • You might want to go through the last paragraph here as well: https://stackoverflow.com/questions/50986201/how-to-absolutely-delete-something-from-elasticsearch/50987159#50987159 – Val Sep 14 '18 at 14:04
  • Interesting. Would I be correct in assuming that setting that too low might adversely affect performance? Would ES perform a merge as soon as that threshhold was reached? Or only at set intervals and if that threshhold were reached? – Adam Benson Sep 14 '18 at 14:57
  • Segment merges happen all the time (behavior shown here: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html), read more here: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-merge.html – Val Sep 14 '18 at 15:13
  • Thanks again for your time. – Adam Benson Sep 14 '18 at 16:16
  • 2
    A related article worth reading: https://www.eivindarvesen.com/blog/2018/09/16/elasticsearch-and-gdpr – Val Sep 24 '18 at 04:51

1 Answers1

-1

My layman's conclusion re: GDPR compliance is the same as yours.

Segment merging happens in the background all the time, when certain conditions are met (like X% of documents in the segments have been marked for deletion). However, there are still potentially situations where you will have data for longer than 30 days (the timeframe you have to delete data in accordance with the GDPR), depending on your cluster architecture, data, etc.

The solution here is to not use Elasticsearch as your primary data store. This is considered best practice. You should use an alias to point to your active index, regularly reindex your data from your source of truth into a new index, and then point your alias to the new index, and delete your old index upon completion.

You might also look into changing the segment merge policy (depending on your use case).

If you're interested in more details, I wrote a blogpost about this back in 2018, and spoke about it at JavaZone 2019.

  • 1
    Thanks for the response, Eivind. We don't use ES as our primary data store so we're OK there. The idea of re-indexing every 30 days is an interesting one - it would certainly work. Our ES server is fairly heavily protected and the data is encrypted at rest but the idea that you can enumerate even deleted documents is a bit uncomfortable! – Adam Benson Jan 13 '20 at 17:31
  • Having followed your link to your blog, I now see what you refer to regarding 30 days - specifically a subject delete request. Of course, the GDPR also requires that data is held for the minimum reasonable time possible, and after that point, it needs to be deleted just as fully as if the subject requested that process manually. Sill, it's quite likely the steps that will allow the OP to be compliant are not the same as yours, unless you happen to be doing exactly the same thing, for exactly the same purpose, e.t.c. – pjcard Sep 11 '20 at 17:58
  • "The solution here is to not use Elasticsearch as your primary data store. This is considered best practice" By whom? Elastic themselves have a whole trove of information in using their stack in a GDPR compliant way https://www.elastic.co/gdpr – pjcard Sep 11 '20 at 18:07
  • Their GDPR-page doesn’t deal with deletion at all, as far as I can tell. It has been considered best practice for many years to not use ES as a primary data store, but traditionally more for resilience-related reasons. https://discuss.elastic.co/t/elasticsearch-2-3-as-primary-data-store/50265 – Eivind Arvesen Sep 12 '20 at 20:23