3

I want to implement a data replay for some use cases we have, and for that, I need to use Kafka retention policy (I am using join and I need the window time to be accurate). P.S. I am using Kafka version 0.10.1.1

I am sending data into the topic like this:

 kafkaProducer.send(
                    new ProducerRecord<>(kafkaTopic, 0, (long) r.get("date_time") ,r.get(keyFieldName).toString(), r)
            );

And I create my topic like this:

kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic myTopic
kafka-topics --zookeeper localhost --alter --topic myTopic --config retention.ms=172800000 kafka-topics --zookeeper localhost --alter --topic myTopic --config segment.ms=172800000

So by the above setting, I should set the retention time of my topic to 48 hours.

I extend TimestampExtractor in order to log the actual time of each message.

public class ConsumerRecordOrWallclockTimestampExtractor implements TimestampExtractor {
    private static final Logger LOG = LoggerFactory.getLogger(ConsumerRecordOrWallclockTimestampExtractor.class);
    @Override
    public long extract(ConsumerRecord<Object, Object> consumerRecord) {
        LOG.info("TIMESTAMP : " + consumerRecord.timestamp() + " - Human readable : " + new Date(consumerRecord.timestamp()));
        return consumerRecord.timestamp() >= 0.1 ? consumerRecord.timestamp() : System.currentTimeMillis();
    }
}

For testing, I have sent 4 messages to my topic and I get this 4 log messages.

2017-02-28 10:23:39 INFO ConsumerRecordOrWallclockTimestampExtractor:21 - TIMESTAMP : 1488295086292 Human readble -Tue Feb 28 10:18:06 EST 2017
2017-02-28 10:24:01 INFO ConsumerRecordOrWallclockTimestampExtractor:21 - TIMESTAMP : 1483272000000 Human readble -Sun Jan 01 07:00:00 EST 2017
2017-02-28 10:26:11 INFO ConsumerRecordOrWallclockTimestampExtractor:21 - TIMESTAMP : 1485820800000 Human readble -Mon Jan 30 19:00:00 EST 2017
2017-02-28 10:27:22 INFO ConsumerRecordOrWallclockTimestampExtractor:21 - TIMESTAMP : 1488295604411 Human readble -Tue Feb 28 10:26:44 EST 2017

So based on Kafka's retention policy I expected to see two of my messaged get purged/deleted after 5 minutes (2nd and 3rd messaged since they are for Jan 1st and Jan 30th). But I tried to consume my topic for an hour and every time I consumed my topic I got all the 4 messages.

kafka-avro-console-consumer --zookeeper localhost:2181 --from-beginning --topic myTopic

My Kafka config is like this:

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

Am I doing something wrong or I miss something here?

Am1rr3zA
  • 7,115
  • 18
  • 83
  • 125

1 Answers1

18

Kafka implements its retention policy by deleting log segments. Kafka never deletes the active segment, which is the segment where it will append new messages sent to the partition. Kafka deletes only old segments. Kafka rolls the active segment into an old segment when a new message is sent to the partition, and either

  • the size of the active segment with the new message would exceed log.segment.bytes, or
  • the timestamp of the first message in the active segment is older than log.roll.ms (default is 7 days)

So in your example, you have to wait 7 days after Tue Feb 28 10:18:06 EST 2017, send a new message, and then all 4 initial messages will be deleted.

Chin Huang
  • 12,912
  • 4
  • 46
  • 47
  • 1
    If so how it explain when I sent two messages with time stamp of 1970 (very old message) after 5 mins both get deleted? – Am1rr3zA Feb 28 '17 at 18:58