4

I am new to kafka. We are trying to import data from a csv file to Kafka. We need to import everyday, in the mean while the previous day's data is depredated. How could remove all messages under a Kafka topic in python? or how could I remove the Kafka topic in python? Or I saw someone suggest to wait to data expire, how could I set the data expiration time if that's possible? Any suggestions will be appreciated!

Thanks

lucky_start_izumi
  • 2,511
  • 13
  • 41
  • 61

2 Answers2

5

You cannot delete messages in Kafka topic. You can:

  • Set log.retention.* properties which is basically the expiration of messages. You can choose either time-based expiration (e. g. keep messages that are six hour old or newer) or space-based expiration (e. g. keep at max 1 GB of messages). See Broker config and search for retention. You can set different values for different topics.
  • Delete the whole topic. It's a kind of tricky and I don't recommend this way.
  • Create a new topic for every day. Something like my-topic-2015-09-21.

But I don't think you need to delete the messages in the topic at all. Because your Kafka consumer keeps track of messages that has been already processed. Thus when you read all today's messages, Kafka consumer saves this information and you're going to read just the new messages tomorrow.

Another possible solution could be Log compaction. But it's more complicated and probably it's not what you need. Basically you can set a key for every message in the Kafka topic. If you send two different messages with the same key, Kafka will keep just the newest message in the topic and it will delete all older messages with the same key. You can think of it as a kind of "key-value store". Every message with the same key just updates a value under the specific key. But hey, you really don't need this, it's just FYI :-).

Lukáš Havrlant
  • 4,134
  • 2
  • 13
  • 18
  • Thanks so much, may I ask is that possible to set the retention for each topic? Different topic may expect different expiration time? – lucky_start_izumi Sep 21 '15 at 20:25
  • Yes, you can do it when you create a new topic, see [`--config` option in Topic-level configuration](http://kafka.apache.org/documentation.html#topic-config). – Lukáš Havrlant Sep 21 '15 at 20:30
  • may I ask you another question, for SimpleProducer, my setting is producer = SimpleProducer(kafka_client, async=True, batch_send_every_n=batch_size, batch_send_every_t=60, async_retry_limit=5) – lucky_start_izumi Sep 21 '15 at 22:50
  • May be you should submit a new post instead of a comment ;) – Lukáš Havrlant Sep 22 '15 at 06:23
2

The simplest approach is to simply delete the topic. I use this in Python automated test suites, where I want to verify a specific set of test messages gets sent through Kafka, and don't want to see results from previous test runs

def delete_kafka_topic(topic_name):
    call(["/usr/bin/kafka-topics", "--zookeeper", "zookeeper-1:2181", "--delete", "--topic", topic_name])
clay
  • 18,138
  • 28
  • 107
  • 192