0

I would like to ask you some questions about apache kafka and compacted topics. We want to provide some PII Data over a kafka compacted topic. We want to delete the data on this topic via tombstone. There are currently multiple questions where we want to verify our assumptions:

  1. Is there an other company which fulfills the gdpr requirement (right to forget) in kafka through a compacted topic with tombstone generation like the KIP-354 proposes https://cwiki.apache.org/confluence/display/KAFKA/KIP-354%3A+Add+a+Maximum+Log+Compaction+Lag?
  2. Is our assumption right, that the compaction is only triggered if the record is not in the active segment file. So in our point of view the kafka documentation needs to be modified by adding this to the kafka documentation point 4.8: The topic's max.compaction.lag.ms can be used to guarantee the maximum delay between the time a message is written and the time the message becomes eligible for compaction. Here it should add the condition, that the message we want to compact should not be in an active segment file. Is this a bug of the max.compaction.lag.ms feature or is it as designed? We are not sure at this point.
  3. Is the compaction only triggered after a new message is inserted? Or is there also an asynchronous process which compacts non active segment files?

Thanks for your answers ;-)

holzleube
  • 159
  • 1
  • 2
  • 10

1 Answers1

1

You are pretty much on point.

  1. Message deletion in a compacted Kafka topic is more or less the same as deleting a row in a database. It just doesn't happen immediately after the tombstone message is sent.
  2. Yes, the active log segment is not compacted. If you want to accelerate the compaction process for this particular topic (in order to satisfy point 1), you can reduce the maximum segment size (segment.bytes, defaults to 1GB) and maximum segment MS (segment.ms, default to 604800000 = 1 week) to some lower values, e.g. 100MB and 1. You should look into min.cleanable.dirty.ratio and set it to a more aggressive value, again depending on the requirements (point 1).
  3. Compaction happens asynchronously and it doesn't matter if any messages were sent after the tombstone or not. There is a component running on each Kafka, broker, the LogCleaner, which is responsible for that.
Martin Ivanov
  • 11
  • 1
  • 1
  • Hi Martin, thanks for your answer. I have a question to your answer. You say that compaction happens asynchronly, but it is not important that a new message is written. With our experience, the compaction in the log cleaner is only triggered after a message is written. Do you have some links to kafka source? – holzleube Sep 16 '20 at 06:43
  • We found the method maybeRoll: https://github.com/apache/kafka/blob/ef96ac07f565a73e35c5b0f4c56c8e87cfbaaf59/core/src/main/scala/kafka/log/Log.scala#L1843 – holzleube Sep 16 '20 at 07:01
  • maybeRoll decides if a new log segment should be rolled. Check point 2 in my original answer. – Martin Ivanov Sep 17 '20 at 08:09
  • The third point is not literally true, check https://stackoverflow.com/questions/70305319/is-it-possible-to-force-the-cleaner-to-compact-a-partition-log-for-partitions-wi – Whimusical Feb 24 '22 at 22:28