2

I had been doing spark streaming jobs which consumer and produce data through kafka. I used directDstream,so I had to manage offset by myself,we adopted redis to write and read offsets.Now there is one problem,when I launched my client,my client need to get the offset from redis,not offset which exists in kafka itself.how show I write my code?Now I had written my code below:

   kafka_stream = KafkaUtils.createDirectStream(
    ssc,
    topics=[config.CONSUME_TOPIC, ],
    kafkaParams={"bootstrap.servers": config.CONSUME_BROKERS,
                 "auto.offset.reset": "largest"},
    fromOffsets=read_offset_range(config.OFFSET_KEY))

But I think the fromOffsets is the value(from redis) when the spark-streaming client lauched,not during its running.thank you for helpinp.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Frank
  • 977
  • 3
  • 14
  • 35
  • for anyone, who is looking for "how to maintain offset value to ZooKeeper" the below link explains with python code. https://stackoverflow.com/questions/44110027/pyspark-kafka-direct-streaming-update-zookeeper-kafka-offset?rq=1 – SaddamBinSyed Jul 07 '21 at 08:13

1 Answers1

0

If I understand you correctly you need to set your offset manually. This is how I do it:

from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming.kafka import TopicAndPartition

stream = StreamingContext(sc, 120) # 120 second window

kafkaParams = {"metadata.broker.list":"1:667,2:6667,3:6667"}
kafkaParams["auto.offset.reset"] = "smallest"
kafkaParams["enable.auto.commit"] = "false"

topic = "xyz"
topicPartion = TopicAndPartition(topic, 0)
fromOffset = {topicPartion: long(PUT NUMERIC OFFSET HERE)}

kafka_stream = KafkaUtils.createDirectStream(stream, [topic], kafkaParams, fromOffsets = fromOffset)
user3689574
  • 1,596
  • 1
  • 11
  • 20
  • yeah,I understand your answer.However,I thought the param "fromOffsets" means getting offset just when the client restart.so how can the client get the offsets when they are consuming?from kafka itself or still the fromOffsets? – Frank Apr 14 '18 at 10:33
  • 1
    I'm not sure what you mean. When you start a stream it has to start from some offset and then from there it just keeps going until it reaches the end of the topic. The param "fromOffsets" means: When you start a "createDirectStream", that means you don't want to read the offset from ZooKeeper, like when you use "createStream", so you need to provide it youself. Are you asking how to get the offset without the use of the ZooKeeper? – user3689574 Apr 14 '18 at 12:37
  • BTW,I want to ask another question,how do you debug your python spark streaming,I used "logger" to output useful messages,but it doesn't work in spark cluster(deploy mode is client) – Frank Apr 15 '18 at 10:31
  • Debug is a problem. Usually I use jupyter to pre run my code and test it in a more friendly environment. If I have to debug in production I use (The quite "ugly") practice of writing logs to SQL from inside the rdd – user3689574 Apr 16 '18 at 09:32