4

I am new to spark and kafka and I have a slightly different usage pattern of spark streaming with kafka. I am using

spark-core_2.10 - 2.1.1
spark-streaming_2.10 - 2.1.1
spark-streaming-kafka-0-10_2.10 - 2.0.0
kafka_2.10 - 0.10.1.1

Continuous event data is being streamed to a kafka topic which I need to process from multiple spark streaming applications. But when I run the spark streaming apps, only one of them receives the data.

     Map<String, Object> kafkaParams = new HashMap<String, Object>();

     kafkaParams.put("bootstrap.servers", "localhost:9092");
     kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
     kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); 
     kafkaParams.put("auto.offset.reset", "latest");
     kafkaParams.put("group.id", "test-consumer-group");
     kafkaParams.put("enable.auto.commit", "true");
     kafkaParams.put("auto.commit.interval.ms", "1000");
     kafkaParams.put("session.timeout.ms", "30000");

     Collection<String> topics =  Arrays.asList("4908100105999_000005");;
     JavaInputDStream<ConsumerRecord<String, String>> stream =  org.apache.spark.streaming.kafka010.KafkaUtils.createDirectStream(
                    ssc,
                    LocationStrategies.PreferConsistent(),
                    ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams) );

      ... //spark processing

I have two spark streaming applications, usually the first one I submit consumes the kafka messages. Second application just waits for messages and never proceeds. As I read, kafka topics can be subscribed from multiple consumers, is it not true for spark streaming ? Or there is something I am missing with kafka topic and its configuration ?

Thanks in advance .

Ruslan Ostafiichuk
  • 4,422
  • 6
  • 30
  • 35
Gurubg
  • 73
  • 7

2 Answers2

2

You can create different streams with same groupids. Here are more details from the online documentation for 0.8 integrations, there are two approaches:

Approach 1: Receiver-based Approach

Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.

Approach 2: Direct Approach (No Receivers)

No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.

You can read more at Spark Streaming + Kafka Integration Guide 0.8

From your code looks like you are using 0.10, refer Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0

Even thought it is using spark streaming api, everything is controlled by kafka properties so depends on group id you specify in properties file, you can start multiple streams with different group id's.

Cheers !

Sachin Thapa
  • 3,559
  • 4
  • 24
  • 42
  • 1
    I was using the same group id in both the consumers, so only one consumer was receiving the messages. Consumers with different group.id subscribing to same topic , receive the messages separately/in parallel. – Gurubg Sep 01 '17 at 10:46
  • Yes, if you use same group id then only one will receive the message. – Sachin Thapa Sep 01 '17 at 13:37
1

Number of consumers [Under a consumer group], cannot exceed the number of partitions in the topic. If you want to consume the messages in parallel, then you will need to introduce suitable number of partitions and create receivers to process each partition.

  • What is the difference between having two consumer groups to having two partitions under the same consumer group ? – Gurubg Oct 26 '17 at 07:01
  • I meant Kafka partitions. If you have two partitions in your Kafka topic and want to process the messages in parallel, then you can introduce a group of consumers [The number of consumers in this consumer group should not exceed the number of partitions in topic being consumed.] Consumer groups are identified by consumer group ids. If two consumer groups have same group id, then Kafka will assume that both these consumer groups as one. If you are using the same code for both your application, then try changing kafkaParams.put("group.id", "test-consumer-group1") for the second application. – Vinoth Chinnasamy Nov 03 '17 at 08:36
  • Does having single partition and reading from two consumer groups affect the performance or throughput of Kafka ? Currently I have 4 topics all with single partition , consuming them from two different consumer groups . I am not sure this will scale up without any dent in the performance when incoming data rate will be increased. – Gurubg Mar 15 '18 at 07:37
  • No adding additional consumers should not affect the performance of Kafka [Network bandwidth could be a bottleneck, please ensure its sufficient enough to support the increase in data transfer. Kafka as such will not have any performance dent]. – Vinoth Chinnasamy Mar 16 '18 at 12:06