2

I am brand new to Spark & Kafka and am trying to get some Scala code (running as a Spark job) to act as a long-running process (not just a short-lived/scheduled task) and to continuously poll a Kafka broker for messages. When it receives messages, I just want them printed out to the console/STDOUT. Again, this needs to be a long-running process and basically (try to) live forever.

After doing some digging, it seems like a StreamingContext is what I want to use. Here's my best attempt:

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.storage._
import org.apache.spark.streaming.{StreamingContext, Seconds, Minutes, Time}
import org.apache.spark.streaming.dstream._
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder

def createKafkaStream(ssc: StreamingContext, kafkaTopics: String, brokers: String): DStream[(String, String)] = {
    val topicsSet = kafkaTopics.split(",").toSet
    val props = Map(
        "bootstrap.servers" -> "my-kafka.example.com:9092",
        "metadata.broker.list" -> "my-kafka.example.com:9092",
        "serializer.class" -> "kafka.serializer.StringEncoder",
        "value.serializer" -> "org.apache.kafka.common.serialization.StringSerializer",
        "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
        "key.serializer" -> "org.apache.kafka.common.serialization.StringSerializer",
        "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer"
    )
    KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, props, topicsSet)
}

def processEngine(): StreamingContext = {
    val ssc = new StreamingContext(sc, Seconds(1))

    val topicStream = createKafkaStream(ssc, "mytopic", "my-kafka.example.com:9092").print()

    ssc
}

StreamingContext.getActive.foreach {
    _.stop(stopSparkContext = false)
}

val ssc1 = StreamingContext.getActiveOrCreate(processEngine)
ssc1.start()
ssc1.awaitTermination()

When I run this, I get no exceptions/errors, but nothing seems to happen. I can confirm there are messages on the topic. Any ideas as to where I'm going awry?

smeeb
  • 27,777
  • 57
  • 250
  • 447

2 Answers2

3

When you foreachRDD, the output is printed in the Worker nodes, not the Master. I'm assuming you're looking at the Master's console output. You can use DStream.print instead:

val ssc = new StreamingContext(sc, Seconds(1))
val topicStream = createKafkaStream(ssc, "mytopic", "my-kafka.example.com:9092").print()

Also, don't forget to call ssc.awaitTermination() after ssc.start():

ssc.start()
ssc.awaitTermination()

As a sidenote, I'm assuming you copy pasted this example, but there's no need to use transform on the DStream if you're not actually planning to do anything with the OffsetRange.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Thanks @Yuval (+1) - are you sure that's all there is to it? When I run this I get output that indicates everything has started up just fine, but it seems to shut down after a few seconds, and none of the messages I have published to the topic seem to get consumed + printed. Thoughts? – smeeb Aug 10 '16 at 14:15
  • 1
    @smeeb I forgot to add that after calling `ssc.start()`, you need to call `ssc.awaitTermination()`. Updated the answer. – Yuval Itzchakov Aug 10 '16 at 14:19
  • Ok now I think we're getting somewhere, thanks @Yuval (again!). That solved the stopping issue, but I'm still not seeing topic messages being printed out. Are you sure `val topicStream = createKafkaStream(ssc, "mytopic", "my-kafka.example.com:9092").print()` will print messages, as they're consumed, to STDOUT? All the examples I've seen use the `map` method that has access to an `event` variable. – smeeb Aug 10 '16 at 14:25
  • I guess in the back of my mind I have the concern that I'm not even connecting to my Kafka broker at all. If I run my "producer" component (not provided in my code above) I can run the Kafka `kafka-console-consumer.sh` (right there on the Kafka server) and see the messages come across. So I *know* my producer is writing messages to the topic. But so far I see no concrete evidence that this "consumer" component above is even talking to Kafka at all! – smeeb Aug 10 '16 at 14:28
  • 2
    Arrrrgggg, I was looking at the driver logs. The messages *are* being printed to STDOUT, just not where I was looking. Mehhhhh – smeeb Aug 10 '16 at 14:54
1

Is this your complete code? where did you create sc? you have to create spark context before streaming context. you can create sc like this :

SparkConf sc = new SparkConf().setAppName("SparkConsumer");

Also, without awaitTermination, it is very hard to catch and print exceptions that occur during the background data processing. Can you add ssc1.awaitTermination(); at the end and see if you get any error.

Explorer
  • 1,491
  • 4
  • 26
  • 67
Hoda Moradi
  • 61
  • 1
  • 1
  • 5
  • Thanks @Honda (+1) - this code is running on Databricks from inside a Scala "notebook", which is like a Scala REPL that provides your `SparkContext` for you via the `sc` variable. And to answer your question, even after adding `awaitTermination` (see my edits) I am not getting any exceptions. – smeeb Aug 10 '16 at 14:40
  • if you start spark in shell, then sc is created for you by spark-shell script – Sergio Alyoshkin Apr 07 '19 at 15:20