0

I have a scenario where I want to re-process a particular batch of data coming in from Kafka using Spark DStreams.

let's say I want to re-process the following batches of data.

Topic-Partition1-{1000,2000} Topic-Partition2-{500-600}

Below is the code snipper I am have, where I can specify the starting offsets.

val inputDStream = KafkaUtils.createDirectStream[String,String](
      ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Assign[String, String](
      topic-partition-list,
      kafkaProps,
      starting-offset-ranges))

But, I want to know is their anyway I can specify the ending offsets as well, like in case of structured streaming batch mode.

So essentially, it should process this small batch and stop the workflow.

Note: I do not want to use structured streaming for this use-case. Want to use DStreams only.

Wojciech Wirzbicki
  • 3,887
  • 6
  • 36
  • 59
Venkata
  • 317
  • 3
  • 13

1 Answers1

0

Found a way to do it.

val offsetRanges = Array(
  // topic, partition, inclusive starting offset, exclusive ending offset
  OffsetRange("test", 0, 0, 100),
  OffsetRange("test", 1, 0, 100)
)

val rdd = KafkaUtils.createRDD[String, String](sparkContext, kafkaParams, offsetRanges, PreferConsistent)
Venkata
  • 317
  • 3
  • 13