This is field to help us to manage the checkpoint kind of thing. Managing offsets is most beneficial to achieve data continuity over the lifecycle of the stream process. For example, upon shutting down the stream application or an unexpected failure, offset ranges will be lost unless persisted in a non-volatile data store. Further, without offsets of the partitions being read, the Spark Streaming job will not be able to continue processing data from where it had last left off. So that we can handle the offset in multiple manner.
One of the way , we can stored the offset value in Zookeeper and read for the same while creating the DSstream.
from kazoo.client import KazooClient
zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()
ZOOKEEPER_SERVERS = "127.0.0.1:2181"
def get_zookeeper_instance():
from kazoo.client import KazooClient
if 'KazooSingletonInstance' not in globals():
globals()['KazooSingletonInstance'] = KazooClient(ZOOKEEPER_SERVERS)
globals()['KazooSingletonInstance'].start()
return globals()['KazooSingletonInstance']
def save_offsets(rdd):
zk = get_zookeeper_instance()
for offset in rdd.offsetRanges():
path = f"/consumers/{var_topic_src_name}"
print(path)
zk.ensure_path(path)
zk.set(path, str(offset.untilOffset).encode())
var_offset_path = f'/consumers/{var_topic_src_name}'
try:
var_offset = int(zk.get(var_offset_path)[0])
except:
print("The spark streaming started First Time and Offset value should be Zero")
var_offset = 0
var_partition = 0
enter code here
topicpartion = TopicAndPartition(var_topic_src_name, var_partition)
fromoffset = {topicpartion: var_offset}
print(fromoffset)
kvs = KafkaUtils.createDirectStream(ssc,\
[var_topic_src_name],\
var_kafka_parms_src,\
valueDecoder=serializer.decode_message,\
fromOffsets = fromoffset)
kvs.foreachRDD(handler)
kvs.foreachRDD(save_offsets)
Reference:
pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset