How to run multiple actions on same Spark Streaming

Question

I am using Spark-streaming along with RabbitMQ. So, streaming job fetched the data from RabbitMQ and apply some transformation and actions. So, I want to know how to apply multiple actions (i.e. calculate two different feature sets) on the same streaming. Is it possible? If Yes, How to pass the streaming object to multiple classes as mentioned in the code?

            val config = ConfigFactory.parseFile(new File("SparkStreaming.conf"))
            val conf = new SparkConf(true).setAppName(config.getString("AppName"))
            conf.set("spark.cleaner.ttl", "120000")         

            val sparkConf = new SparkContext(conf)
            val ssc = new StreamingContext(sparkConf, Seconds(config.getLong("SparkBatchInterval")))

            val rabbitParams =  Map("storageLevel" -> "MEMORY_AND_DISK_SER_2","queueName" -> config.getString("RealTimeQueueName"),"host" -> config.getString("QueueHost"), "exchangeName" -> config.getString("QueueExchangeName"), "routingKeys" -> config.getString("QueueRoutingKey"))
            val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
            receiverStream.start()

How to process stream from here :

            val objProcessFeatureSet1 = new ProcessFeatureSet1(Some_Streaming_Object)
            val objProcessFeatureSet2 = new ProcessFeatureSet2(Some_Streaming_Object)

            ssc.start()
            ssc.awaitTermination()

score 5 · Answer 1 · answered Sep 02 '16 at 12:37

You can run the multiple actions on same dstream as shown below:

import net.minidev.json.JSONValue
import net.minidev.json.JSONObject

val config = ConfigFactory.parseFile(new File("SparkStreaming.conf"))
val conf = new SparkConf(true).setAppName(config.getString("AppName"))
conf.set("spark.cleaner.ttl", "120000")         

val sparkConf = new SparkContext(conf)
val ssc = new StreamingContext(sparkConf, Seconds(config.getLong("SparkBatchInterval")))

val rabbitParams =  Map("storageLevel" -> "MEMORY_AND_DISK_SER_2","queueName" -> config.getString("RealTimeQueueName"),"host" -> config.getString("QueueHost"), "exchangeName" -> config.getString("QueueExchangeName"), "routingKeys" -> config.getString("QueueRoutingKey"))
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)

val jsonStream = receiverStream.map(byteData => {
    JSONValue.parse(byteData)
})
jsonStream.filter(json => {
    var customerType = json.get("customerType")
    if(customerType.equals("consumer")) 
        true
    else 
        false
}).foreachRDD(rdd => {
    rdd.foreach(json => {
        println("json " + json)
    })
})

jsonStream.filter(json => {
    var customerType = json.get("customerType")
    if(customerType.equals("non-consumer")) 
              true
        else 
              false
}).foreachRDD(rdd => {
     rdd.foreach(json => {
            println("json " + json)
     })
})
ssc.start()
ssc.awaitTermination()

In the code above snippet I am first creating the jsonStream from the received stream, then creating two different stream from it based on the customer type and then applying (foreachRDD) actions on them to print the results.

In the similar way you can pass the same dstream to two different classes and apply the transformation and actions inside it to calculate the different feature set.

I hope above explanation helps you in resolving the issue.

Thanks,
Hokam

Note that this solution will reevaluate `JSONValue.parse` both times, then it will do the filter both times. In this scenario a better approach would be to partition on the condition and fork the process after partitioning. — Brett Ryan, Feb 12 '18 at 20:32
Hi Brett, We can use the persist operator on "jsonStream" received after performing the parse operation using map transformation. — Hokam, Feb 14 '18 at 05:09
That’s correct. But the solution does not :) however, you can avoid the persist and instead partition on the one condition. — Brett Ryan, Feb 14 '18 at 06:16
I don't know what partition logic you are talking about, it would be nice if you could just post it, but as you said solution will re-evaluate the JSONValue.parse two times, this re-evaluation can be avoided by persist operator without much effort, just call persist(http://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence) on jsonStream that's it. — Hokam, Feb 14 '18 at 06:46

How to run multiple actions on same Spark Streaming

1 Answers1