0

I'm doing some tests in spark-shell after having loaded a jar with the Twitter utilities. Here is a code sequence that works:

// launch:
// spark-shell --driver-memory 1g --master local[3] --jars target/scala-2.10/tweetProcessing-1.0.jar

import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._

val consumerKey = ...
val consumerSecret = ...
val accessToken = ...
val accessTokenSecret = ...
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
System.setProperty("twitter4j.oauth.accessToken", accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)

val ssc = new StreamingContext(sc, Seconds(60))
val tweetStream = TwitterUtils.createStream(ssc, None)
val myNewStream = tweetStream.map(tweet => tweet.getText)
    .map(tweetText => tweetText.toLowerCase.split("\\W+"))
    .transform(rdd => 
        rdd.map(tweetWordSeq => { 
            tweetWordSeq.foreach { word => { 
                val mySet = Set("apple", "orange");
                if(!(mySet)(word)) word }
            }
        }))
myNewStream.foreachRDD((rdd,time) => { 
    println("%s at time %s".format(rdd.count(),time.milliseconds))
})
ssc.start()

(actually I reduced to a maximum the computation I make, just to highlight the problem). Here the mySet is serialized and everything goes well.

But when I'm using instead a broadcast variable and replace the test accordingly:

val ssc = new StreamingContext(sc, Seconds(60))

val mySet = sc.broadcast(Set("apple", "orange"))

val tweetStream = TwitterUtils.createStream(ssc, None)
val myNewStream = tweetStream.map(tweet => tweet.getText)
    .map(tweetText => tweetText.toLowerCase.split("\\W+"))
    .transform(rdd => 
        rdd.map(tweetWordSeq => { 
            tweetWordSeq.foreach { word => { 
                if(!(mySet.value)(word)) word }
            }
        }))
myNewStream.foreachRDD((rdd,time) => { 
    println("%s at time %s".format(rdd.count(),time.milliseconds))
})
ssc.start()

I get:

ERROR JobScheduler: Error generating jobs for time 1464335160000 ms
org.apache.spark.SparkException: Task not serializable
...
Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.dstream.TransformedDStream is being serialized  possibly as a part of closure of an RDD operation. This is because  the DStream object is being referred to from within the closure.  Please rewrite the RDD operation inside this DStream to avoid this.  This has been enforced to avoid bloating of Spark tasks  with unnecessary objects.

I would naturally prefer to use broadcast variables (my set is actually a rather large set of stop words) but I don't quite see where the problem comes from.

Michel
  • 11
  • 2

1 Answers1

0

You need to create the broadcast variable in the driver, (outside of any closures) not within any transformation like transform, foreachRDD etc.

val ssc = new StreamingContext(sc, Seconds(60))
val mySet = ssc.sparkContext.broadcast(Set("apple", "orange"))

Then, you can access the broadcast variable within the transform or other DStream closures on the executors like,

!(mySet.value)(word)

If you have this statement sc.broadcast(Set("apple", "orange")) within the rdd.map of the transform closure, driver will try to send the StreamingContext over to all executors and it is not serializable. That's why you are seeing NotSerializableException

Pranav Shukla
  • 2,206
  • 2
  • 17
  • 20
  • Sorry, my message was indeed not very clear about this, I obviously wrote `val mySet = sc.broadcast(Set("apple", "orange"))` before the `val myNewStream = tweetStream.map...`. Is it because I'm accessing the Spark context directly (sc) and not via the Streaming context (ssc.sparkContext)? – Michel May 27 '16 at 11:48
  • Oh ok :). Could you please copy-paste the entire code that throws the error? – Pranav Shukla May 27 '16 at 11:52
  • Just upgraded Spark from 1.5.1 to 1.6.1 (and rebuilt the jar), the minimal code appears to work now with the broadcast variable. I'll try the full code. – Michel May 27 '16 at 12:30
  • I don't see anything wrong with the code. I have very similar code working with streaming and broadcast variables. – Pranav Shukla May 27 '16 at 12:51
  • Just tried again, the version with the broadcast variable doesn't work, because of the serialization problem. Exactly the same code (with the broadcast variable) works well with a fixed RDD read from a file (i.e. not within a DStream). – Michel May 30 '16 at 12:25