0

I am currently able to consume the lastest real-time data in kafka but is there a way to consume the last 5 mins data in each partition in an optimized way?

The current way of doing this is setting auto.offset.reset to earliest and then consuming till it reaches the end of the offset in each partion that lies in the 5 minute timestamp. But this takes a long time.

If there is a way to do this but in reverse order so as to reduce the cosumption time, it would be really helpful!

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
potterson11
  • 147
  • 7

1 Answers1

3

The confluent_kafka.Consumer.offsets_for_times() function provides a mechanism to obtain the earliest offsets for TopicPartition objects where the timestamps are greater than or equal to a POSIX timestamp provided in milliseconds.

You could register a callback function for the on_assign event when subscribing to your topic(s) that uses Consumer.offsets_for_times() and Consumer.assign() to reset offsets on your assigned partitions to the desired positions prior to consuming messages.

For example, you might do something like this:

import datetime
import math
from confluent_kafka import Consumer, TopicPartition

def get_time_offset():
  '''Returns the POSIX epoch representation (in milliseconds) of the datetime 5 minutes prior to being called'''
  delta = datetime.timedelta(minutes=5)  
  now = datetime.datetime.now(datetime.timezone.utc)  # TZ-aware object to simplify POSIX epoch conversion
  prior = now - delta  
  return math.floor(prior.timestamp() * 1000)  # convert seconds to milliseconds for Consumer.offsets_for_times()

def reset_offsets(consumer, partitions):
  '''Resets the offsets of the provided partitions to the first offsets found corresponding to timestamps greater than or equal to 5 minutes ago.'''
  time_offset = get_time_offset()
  search_partitions = [TopicPartition(p.topic, p.partition, time_offset) for p in partitions]  # new TPs with offset= time_offset
  time_offset_partitions = consumer.offsets_for_times(search_partitions)  # find TPs with timestamp of earliest offset >= time_offset
  consumer.assign(time_offset_partitions)  # (re-)set consumer partition assignments and start consuming

topics = ['my-topic-of-interest']

c = Consumer({
  'bootstrap.servers': 'server-fqdn',
  'group.id': 'group-name'
})

c.subscribe(topics, on_assign=reset_offsets)  # reset_offsets() called when partition assignment received after c.poll()

# Process all messages from reset offsets (5 min. ago) to present (and ongoing)
while True:
  try:
    msg = c.poll()  # first call triggers execution of on_assign callback function, resetting offsets
  except RuntimeError as e:
    print("Consumer is closed.")
    break
  # process message and commit...

c.close()
Philip Wrage
  • 1,505
  • 1
  • 12
  • 23