I'm doing some tests in spark-shell after having loaded a jar with the Twitter utilities. Here is a code sequence that works:
// launch:
// spark-shell --driver-memory 1g --master local[3] --jars target/scala-2.10/tweetProcessing-1.0.jar
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
val consumerKey = ...
val consumerSecret = ...
val accessToken = ...
val accessTokenSecret = ...
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
System.setProperty("twitter4j.oauth.accessToken", accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)
val ssc = new StreamingContext(sc, Seconds(60))
val tweetStream = TwitterUtils.createStream(ssc, None)
val myNewStream = tweetStream.map(tweet => tweet.getText)
.map(tweetText => tweetText.toLowerCase.split("\\W+"))
.transform(rdd =>
rdd.map(tweetWordSeq => {
tweetWordSeq.foreach { word => {
val mySet = Set("apple", "orange");
if(!(mySet)(word)) word }
}
}))
myNewStream.foreachRDD((rdd,time) => {
println("%s at time %s".format(rdd.count(),time.milliseconds))
})
ssc.start()
(actually I reduced to a maximum the computation I make, just to highlight the problem). Here the mySet is serialized and everything goes well.
But when I'm using instead a broadcast variable and replace the test accordingly:
val ssc = new StreamingContext(sc, Seconds(60))
val mySet = sc.broadcast(Set("apple", "orange"))
val tweetStream = TwitterUtils.createStream(ssc, None)
val myNewStream = tweetStream.map(tweet => tweet.getText)
.map(tweetText => tweetText.toLowerCase.split("\\W+"))
.transform(rdd =>
rdd.map(tweetWordSeq => {
tweetWordSeq.foreach { word => {
if(!(mySet.value)(word)) word }
}
}))
myNewStream.foreachRDD((rdd,time) => {
println("%s at time %s".format(rdd.count(),time.milliseconds))
})
ssc.start()
I get:
ERROR JobScheduler: Error generating jobs for time 1464335160000 ms
org.apache.spark.SparkException: Task not serializable
...
Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.dstream.TransformedDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
I would naturally prefer to use broadcast variables (my set is actually a rather large set of stop words) but I don't quite see where the problem comes from.