2

I am new to Apache Spark and have a need to run several long-running processes (jobs) on my Spark cluster at the same time. Often, these individual processes (each of which is its own job) will need to communicate with each other. Tentatively, I'm looking at using Kafka to be the broker in between these processes. So the high-level job-to-job communication would look like:

  1. Job #1 does some work and publishes message to a Kafka topic
  2. Job #2 is set up as a streaming receiver (using a StreamingContext) to that same Kafka topic, and as soon as the message is published to the topic, Job #2 consumes it
  3. Job #2 can now do some work, based on the message it consumed

From what I can tell, streaming contexts are blocking listeners that run on the Spark Driver node. This means that once I start the streaming consumer like so:

def createKafkaStream(ssc: StreamingContext,
        kafkaTopics: String, brokers: String): DStream[(String, 
        String)] = {
    // some configs here
    KafkaUtils.createDirectStream[String, String, StringDecoder,
        StringDecoder](ssc, props, topicsSet)
}

def consumerHandler(): StreamingContext = {
    val ssc = new StreamingContext(sc, Seconds(10))

    createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
        rdd.collect().foreach { msg =>
            // Now do some work as soon as we receive a messsage from the topic
        }
    })

    ssc
}

StreamingContext.getActive.foreach {
    _.stop(stopSparkContext = false)
}

val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()

...that there are now 2 implications:

  1. The Driver is now blocking and listening for work to consume from Kafka; and
  2. When work (messages) are received, they are sent to any available Worker Nodes to actually be executed upon

So first, if anything that I've said above is incorrect or is misleading, please begin by correcting me! Assuming I'm more or less correct, then I'm simply wondering if there is a more scalable or performant way to accomplish this, given my criteria. Again, I have two long-runnning jobs (Job #1 and Job #2) that are running on my Spark nodes, and one of them needs to be able to 'send work to' the other one. Any ideas?

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
smeeb
  • 27,777
  • 57
  • 250
  • 447
  • 1
    BTW - Using `rdd.collect` inside foreachRDD will cause they entire dataset to be sent back to the driver. You definitely don't want that. – Yuval Itzchakov Aug 15 '16 at 17:51
  • Thanks @Yuval (+1), whats a better/more efficient way to gain access to the individual messages being consumed? That wasn't my intention, I'm just new to the API, so feel free to update my code! – smeeb Aug 15 '16 at 17:52
  • 1
    You can use `rdd.foreach`. – Yuval Itzchakov Aug 15 '16 at 18:55

1 Answers1

2

From what I can tell, streaming contexts are blocking listeners that run on the Spark Driver node.

A StreamingContext (singular) isn't a blocking listener. It's job is to create the graph of execution for your streaming job.

When you start reading from Kafka, you specify that you want to fetch new records every 10 seconds. What happens from now on depends on which Kafka abstraction you're using for Kafka, either the Receiver approach via KafkaUtils.createStream, or the Receiver-less approach via KafkaUtils.createDirectStream.

In both approaches in general, data is being consumed from Kafka and then dispatched to each Spark worker to process in parallel.

then I'm simply wondering if there is a more scalable or performant way to accomplish this

This approach is highly scalable. When using the receiver-less approach, each Kafka partition maps to a Spark partition in a given RDD. You can increase parallelism by either increasing the amount of partitions in Kafka, or by re-partitions the data inside Spark (using DStream.repartition). I suggest testing this setup to determine if it suits your performance requirements.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Thanks @Yuval (+1), several followup questions for you, if you don't mind! **(1)** Can you first confirm that in order to set up "competing consumers" against a Kafka topic on Spark, that I need to have 1-consumer-per-cluster? And is this true with both receiver & receiver-less configs? **(2)** What are the general guidelines for when to use receiver vs. receiver-less approaches? – smeeb Aug 15 '16 at 15:37
  • **(3)** When you say "*When you start reading from Kafka, you specify that you want to fetch new records every 10 seconds...*", where is this configured? Can it be set to something other than 10 seconds? And finally, **(4)** What are Kafka partitions mapped to when using the receiver approach? Thanks again so much! – smeeb Aug 15 '16 at 15:37
  • 1
    @smeeb 1) What do you mean by *"competing consumers"*? 2) I generally suggest using the direct streaming approach, it was introduced in Spark 1.3.0 and has many advantages. I suggest reading up on it. 3) It is configured here: `val ssc = new StreamingContext(sc, Seconds(10))`. 4) In the receiver based approach there's no mapping of partitions. If you want to read in concurrently from Kafka, you have to connect multiple consumers, meaning you have to perform multiple `KafkaUtils.createStream` calls and union them. – Yuval Itzchakov Aug 15 '16 at 15:57
  • Thanks again @Yuval (+1 again)! By "*competing consumers*", I mean multiple consumer threads all listening to and consuming messages from the same Kafka topic. So my first question above was really asking for clarification on my understanding of streaming consumers on Spark, which is: each concurrent consumer **must** be running on its own Spark cluster (true or false?). Or could I just make multiple calls to `KafkaUtils.createDStream` from inside `consumerHandler`, and each call sets up a different 'competing' consumer thread for me (on the same cluster). – smeeb Aug 15 '16 at 16:08
  • I guess I'm confused about the cardinality between cluster and consumer thread. – smeeb Aug 15 '16 at 16:08
  • 1
    If you use the direct stream approach, you'll need neither. Spark will partition Kafka offsets between the workers, and they can concurrently read from Kafka. The cardinality in the direct stream approach is 1:1 between kafka partition and spark RDD partition. The concurrency of reading a topic is based on the number of spark workers that can read offsets from Kafka. – Yuval Itzchakov Aug 15 '16 at 16:39
  • Thanks and +1 again @Yuval - so if I'm understanding what you're saying correctly, `KafkaUtils.createDirectStream` takes care of all this magic for me? Meaning in my `createKafkaStream` function above (please see my edits), if I have 10 available worker nodes, then Spark will automagically set up 10 concurrent consumers on those 10 workers, using the exact code that I've added above (and no other special configs)? – smeeb Aug 15 '16 at 17:11
  • @smeeb If you have 10 partitions in your topic that they can cosume from, then yes. – Yuval Itzchakov Aug 15 '16 at 17:38