3

We use an ELK stack for our logging. I've been asked to design a process for how we would remove sensitive information that had been logged accidentally.

Now based on my reading around how ElasticSearch (Lucene) handles deletes and updates the data is still in the index just not available. It will ultimately get cleaned up as indexes get merged, etc..

Is there a process to run an update (to redact something) or delete (to remove something) and guarantee its removal?

baynezy
  • 6,493
  • 10
  • 48
  • 73

1 Answers1

7

When updating or deleting some value, ES will mark the current document as deleted and index the new document. The deleted value will still be available in the index, but will never get back from a search. Granted, if someone gets access to the underlying index files, he might be able to use some tool (Luke or similar) to view what's inside the index files and potentially see the deleted sensitive data.

The only way to guarantee that the documents marked as deleted are really deleted from the index segments, is to force a merge of the existing segments.

POST /myindex/_forcemerge?only_expunge_deletes=true

Be aware, though, that there is a setting called index.merge.policy.expunge_deletes_allowed that defines a threshold below which the force merge doesn't happen. By default this threshold is set at 10%, so if you have less than 10% deleted documents, the force merge call won't do anything. You might need to lower the threshold in order for the deletion to happen... or maybe easier, make sure to not index sensitive information.

Val
  • 207,596
  • 13
  • 358
  • 360
  • what happens in case we don't specify "max_num_segments" or "expunge_deletes_allowed" then if suppose an index has a thousand segments, then what's the default behaviour here, how many segments will it have after merge? – Yash Tandon Sep 16 '22 at 10:18
  • @YashTandon If you don't specify anything, ES defaults to checking if a merge needs to execute, and if so, executes it. – Val Sep 16 '22 at 10:56
  • I mean I am unaware that will it merge and create a single large segment or will it merge and create a few merged small segments? – Yash Tandon Sep 16 '22 at 11:32
  • It will never create a single segment if you don't specify it explicitly via `max_num_segments=1`. By default it tries to half the number of existing segments – Val Sep 16 '22 at 11:35
  • Also is there a max size limit that could be set to make sure that large segments are not created during merge, as in my index I have multiple segments of size 4-5 GB approx and they do not get merged automatically bcoz of their size now and hence deleted docs have accumulated in them, posted my query here https://stackoverflow.com/q/73742366/11432290 – Yash Tandon Sep 16 '22 at 12:02
  • There are a few [undocumented settings](https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/index/MergePolicyConfig.java) such as `index.merge.policy.max_merged_segment` that allows you to change that. 5gb is the default configuration value. – Val Sep 16 '22 at 12:09
  • Could you please look into my issue and suggest me with a suitable approach that I should consider as this involves a production ES Cluster – Yash Tandon Sep 16 '22 at 12:12
  • @Val what happens if my segment sizes are more than 5GB, say some 7GB and I call forcemerge. Does the forcemerge pick the 7GB segment for merge or ignores it ? – Invictus Feb 22 '23 at 06:25