I want know whether there's some way other than offset to fetch data with respect to time interval? Say, I want to consume all the date of yesterday, how do I do it?
-
Two-step process: Find the offset ranges that correspond to your date range and then consume those (by offset). https://stackoverflow.com/questions/39514167/retrieve-timestamp-based-data-from-kafka – Thilo May 18 '18 at 07:01
-
@Thilo thanks for the comment, I did see that old thread and was wondering whether any change happened.. it means that I store offset details somewhere off the Kafka and based on that I query Kafka, right? – Volatil3 May 18 '18 at 07:04
-
Not necessarily. Recent Kafka versions do include a timestamp on all messages. So no need for off-Kafka storage. – Thilo May 18 '18 at 07:05
-
@Thilo can you please help me to find some example to access messages w.r.t timestamp? I am implementing in Python but Java example will work.. I could not find myself as yet. – Volatil3 May 18 '18 at 07:09
-
I'm not familiar with the Python API, but if you can access record metadata (offset, partition number) when you consume messages, then there should also be the timestamp. – Thilo May 18 '18 at 07:14
3 Answers
Use offsetsForTimes to get right offset related to the required timestamp. In Python it will be like next:
from datetime import datetime
from kafka import KafkaConsumer, TopicPartition
topic = "www.kilskil.com"
broker = "localhost:9092"
# lets check messages of the first day in New Year
date_in = datetime(2019,1,1)
date_out = datetime(2019,1,2)
consumer = KafkaConsumer(topic, bootstrap_servers=broker, enable_auto_commit=True)
consumer.poll() # we need to read message or call dumb poll before seeking the right position
tp = TopicPartition(topic, 0) # partition n. 0
# in simple case without any special kafka configuration there is only one partition for each topic channel
# and it's number is 0
# in fact you asked about how to use 2 methods: offsets_for_times() and seek()
rec_in = consumer.offsets_for_times({tp:date_in.timestamp() * 1000})
rec_out = consumer.offsets_for_times({tp:date_out.timestamp() * 1000})
consumer.seek(tp, rec_in[tp].offset) # lets go to the first message in New Year!
c = 0
for msg in consumer:
if msg.offset >= rec_out[tp].offset:
break
c += 1
# message also has .timestamp field
print("{c} messages between {_in} and {_out}".format(c=c, _in=str(date_in), _out=str(date_out)))
Don't forget that Kafka measures timestamp in milliseconds and it have long type. Python lib datetime return timestamps in seconds so we need to multiply it by 1000. Method offsets_for_times
returns a dict with TopicPartition
keys and OffsetAndTimestamp
values.

- 243
- 4
- 14
-
Is there a way to handle date_out > latest offset timestamp scenario ? looks like it will fail now – SureshCS Aug 11 '20 at 11:19
-
-
You can find the earliest offset for the beginning of the specified time interval and rewind to this offset. However, it is difficult to understand where the end of the interval is as the records with the earliest timestamps may arrive later. So you can consume the records from the start of the interval until you find the records with timestamps later than the endTime plus some more records to catch the late messages.
The code for the rewinding to the startTime is:
public void rewind(DateTime time) {
Set<TopicPartition> assignments = consumer.assignment();
Map<TopicPartition, Long> query = new HashMap<>();
for (TopicPartition topicPartition : assignments) {
query.put(topicPartition, time.getMillis());
}
Map<TopicPartition, OffsetAndTimestamp> result = consumer.offsetsForTimes(query);
result.entrySet().stream().forEach(entry -> consumer.seek(entry.getKey(),
Optional.ofNullable(entry.getValue()).map(OffsetAndTimestamp::offset).orElse(new Long(0))));
}

- 1,483
- 9
- 16
Here is a minimalistic example in python with multiple partitions
from kafka import KafkaConsumer, TopicPartition
from datetime import datetime, timedelta
consumer = KafkaConsumer("test", bootstrap_servers="localhost:9092", group_id="group1", max_poll_records=5)
consumer.poll()
assignment = consumer.assignment()
date_in = datetime.now() - timedelta(minutes=50)
date_out = datetime.now() - timedelta(minutes=20)
for partition in assignment:
rec_in = consumer.offsets_for_times({partition:date_in.timestamp() * 1000})
if(rec_in[partition] != None):
consumer.seek(partition, rec_in[partition].offset)
for msg in consumer:
if(msg.timestamp > date_out.timestamp()):
print("pausing partiton=" + str(msg.partition))
consumer.pause(TopicPartition("test2", msg.partition))
if(len(consumer.paused()) == len(consumer.assignment())):
break;
print(msg)

- 5,789
- 1
- 34
- 59
-
-
the above code works on python for sure, i am not sure of other sdks. – best wishes Apr 27 '23 at 01:54