How to read concurrently from each Kafka partition in Spark Streaming DirectAPI

Question

If I am correct, by default spark streaming 1.6.1 uses a single thread to read data from each Kafka partition, let assume my Kafka topic partition is 50 and that means messages in each 50 partitions will be read sequentially or may in round robin fashion.

Case 1:

-If yes, then how do I parallelize read operation at partition level? Is creating multiple KafkaUtils.createDirectStream is the only solution?

e.g.
      val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
        ssc, kafkaParams, topicsSet).map(_._2)

      val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
        ssc, kafkaParams, topicsSet).map(_._2)

Case 2:

-If my kafka partition is receiving 5 messages/sec then, how does "--conf spark.streaming.kafka.maxRatePerPartition=3" and "--conf spark.streaming.blockInterval" properties comes into picture in such scenario?

score 1 · Answer 1 · answered Dec 12 '16 at 18:49

1

In direct model:

each partition is accessed sequentially
different partitions are accessed in parallel

In the second case it depends on the interval but in general if maxRatePerPartition is lower than actual rate per second times batch window you'll be always lagging.

answered Dec 12 '16 at 18:49

So if my Kafka is storing lots of messages at extreme scale, then having multiple streams is the only solution to consume messages concurrently Or setting an appropriate value in "maxRatePerPartition= value" will do which one is recommended? One more thing I have noticed that DirectAPI do not show backlog batches on streaming UI, I guess this is because it calculates offset range at each batch interval so data consumption happens when the last batch finishes unlike receiver approach where it use to store incoming messages if the current batch exceeds its batch interval. – nilesh1212 Dec 14 '16 at 06:21
One more question , spark.streaming.blockInterval will apply for 1-1 reading between spark partition and kafka partition right ? As spark talks 1-1 with kafka partition. – nilesh1212 Dec 14 '16 at 14:55

score 1 · Answer 2 · answered Dec 13 '16 at 06:11

In case two:

spark.streaming.blockInterval

Only impact receiver, you can see doc:

Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark.

spark.streaming.kafka.maxRatePerPartition = 3 < 5(you say)

Total delay will increase,you can see this

http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval

How to read concurrently from each Kafka partition in Spark Streaming DirectAPI

2 Answers2