5

I want know whether there's some way other than offset to fetch data with respect to time interval? Say, I want to consume all the date of yesterday, how do I do it?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Volatil3
  • 14,253
  • 38
  • 134
  • 263
  • Two-step process: Find the offset ranges that correspond to your date range and then consume those (by offset). https://stackoverflow.com/questions/39514167/retrieve-timestamp-based-data-from-kafka – Thilo May 18 '18 at 07:01
  • @Thilo thanks for the comment, I did see that old thread and was wondering whether any change happened.. it means that I store offset details somewhere off the Kafka and based on that I query Kafka, right? – Volatil3 May 18 '18 at 07:04
  • Not necessarily. Recent Kafka versions do include a timestamp on all messages. So no need for off-Kafka storage. – Thilo May 18 '18 at 07:05
  • @Thilo can you please help me to find some example to access messages w.r.t timestamp? I am implementing in Python but Java example will work.. I could not find myself as yet. – Volatil3 May 18 '18 at 07:09
  • I'm not familiar with the Python API, but if you can access record metadata (offset, partition number) when you consume messages, then there should also be the timestamp. – Thilo May 18 '18 at 07:14

3 Answers3

13

Use offsetsForTimes to get right offset related to the required timestamp. In Python it will be like next:

from datetime import datetime
from kafka import KafkaConsumer, TopicPartition

topic  = "www.kilskil.com" 
broker = "localhost:9092"

# lets check messages of the first day in New Year
date_in  = datetime(2019,1,1)
date_out = datetime(2019,1,2)

consumer = KafkaConsumer(topic, bootstrap_servers=broker, enable_auto_commit=True)
consumer.poll()  # we need to read message or call dumb poll before seeking the right position

tp      = TopicPartition(topic, 0) # partition n. 0
# in simple case without any special kafka configuration there is only one partition for each topic channel
# and it's number is 0

# in fact you asked about how to use 2 methods: offsets_for_times() and seek()
rec_in  = consumer.offsets_for_times({tp:date_in.timestamp() * 1000})
rec_out = consumer.offsets_for_times({tp:date_out.timestamp() * 1000})

consumer.seek(tp, rec_in[tp].offset) # lets go to the first message in New Year!

c = 0
for msg in consumer:
  if msg.offset >= rec_out[tp].offset:
    break

  c += 1
  # message also has .timestamp field

print("{c} messages between {_in} and {_out}".format(c=c, _in=str(date_in), _out=str(date_out)))

Don't forget that Kafka measures timestamp in milliseconds and it have long type. Python lib datetime return timestamps in seconds so we need to multiply it by 1000. Method offsets_for_times returns a dict with TopicPartition keys and OffsetAndTimestamp values.

Jo Ja
  • 243
  • 4
  • 14
3

You can find the earliest offset for the beginning of the specified time interval and rewind to this offset. However, it is difficult to understand where the end of the interval is as the records with the earliest timestamps may arrive later. So you can consume the records from the start of the interval until you find the records with timestamps later than the endTime plus some more records to catch the late messages.

The code for the rewinding to the startTime is:

public void rewind(DateTime time) {
    Set<TopicPartition> assignments = consumer.assignment();
    Map<TopicPartition, Long> query = new HashMap<>();
    for (TopicPartition topicPartition : assignments) {
        query.put(topicPartition, time.getMillis());
    }
    Map<TopicPartition, OffsetAndTimestamp> result = consumer.offsetsForTimes(query);

    result.entrySet().stream().forEach(entry -> consumer.seek(entry.getKey(),
            Optional.ofNullable(entry.getValue()).map(OffsetAndTimestamp::offset).orElse(new Long(0))));
}
Katya Gorshkova
  • 1,483
  • 9
  • 16
0

Here is a minimalistic example in python with multiple partitions

from kafka import KafkaConsumer, TopicPartition
from datetime import datetime, timedelta

consumer = KafkaConsumer("test", bootstrap_servers="localhost:9092", group_id="group1", max_poll_records=5)
consumer.poll()
assignment = consumer.assignment()
date_in  = datetime.now() - timedelta(minutes=50)
date_out  = datetime.now() - timedelta(minutes=20)

for partition in assignment:
    rec_in  = consumer.offsets_for_times({partition:date_in.timestamp() * 1000})
    if(rec_in[partition] != None):
        consumer.seek(partition, rec_in[partition].offset)
        
for msg in consumer:
    if(msg.timestamp > date_out.timestamp()):
        print("pausing partiton=" + str(msg.partition))
        consumer.pause(TopicPartition("test2", msg.partition))
        if(len(consumer.paused()) == len(consumer.assignment())):
            break;
    print(msg)
best wishes
  • 5,789
  • 1
  • 34
  • 59