Serialization issues in Spark Streaming

Question

I'm quite confused about how Spark works with the data under the hood. For example, when I run a streaming job and apply foreachRDD, the behaviour depends on whether a variable is captured from the outer scope or initialised inside.

val sparkConf = new SparkConf()
dStream.foreachRDD(rdd => {
    val spark = SparkSession.builder.config(sparkConf).getOrCreate()
    ...
})

In this case, I get an exception:

java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.

But if I move sparkConf inside, everything seems to be fine:

dStream.foreachRDD(rdd => {
    val sparkConf = rdd.sparkContext.getConf
    val spark = SparkSession.builder.config(sparkConf).getOrCreate()
    ...
})

This looks quite odd to me because I thought that foreachRDD runs on the driver node, so I didn't expect any difference.

Now, if I move both the SQL session and the config outside foreachRDD, it works fine again:

val sparkConf = new SparkConf()
val spark = SparkSession.builder.config(sparkConf).getOrCreate()
dStream.foreachRDD(rdd => {
    val df = spark.read.json(rdd)
    ...
})

A snippet in Spark documentation suggests the previous version (where both config and SQL context are created inside foreachRDD), which seems less efficient to me: why create them for every batch if they could be created just once?

Could someone explain why the exception is thrown and what is the proper way to create the SQL context?

*I thought that foreachRDD runs on the driver node* The method passed to `foreachRDD` runs on the workers, not the driver. — Yuval Itzchakov, Sep 26 '16 at 19:51
@YuvalItzchakov I don't think so, because `foreachRDD` operates on the whole RDD, not on partitions or elements of that RDD. And the documentation explicitly says that it runs on the driver node: http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams — lizarisk, Sep 26 '16 at 23:53
You're right, I didn't phrase myself correctly. I meant to say that delegates passed to the `rdd`, (i.e., any operation you'd want to do on the dataframe) will run on the workers nodes. — Yuval Itzchakov, Sep 27 '16 at 05:46
@lizarisk : you cannot use any un-serialized classes inside transformations or actions operations, which is running on worker nodes. — Shankar, Sep 27 '16 at 17:22
@Shankar I don't see anything that runs on a worker node here — lizarisk, Sep 27 '16 at 17:43
Your assumption is correct, the `{rdd => }` block starts running on the driver according to this code: https://github.com/koeninger/kafka-exactly-once/blob/65dd49592dcc3b44e174f7fb9aaab1162a6ebcc1/src/main/scala/example/TransactionalPerBatch.scala#L73 — Kyr, Jan 12 '18 at 13:49

score 1 · Answer 1 · answered Mar 30 '18 at 13:06

ForeachRDD run, as the name suggest, foreach rdd in the streaming why you should recreate the spark context at each rdd? The correct approach is the last one :

val sparkConf = new SparkConf()
val spark = SparkSession.builder.config(sparkConf).getOrCreate()
dStream.foreachRDD(rdd => {
    val df = spark.read.json(rdd)
    ...
})

score 0 · Answer 2 · edited Dec 08 '18 at 09:46

0

val spark = SparkSession.builder.config(sparkConf).getOrCreate() does not create another SparkSession. Only one exists. On worker, just get it from job.

edited Dec 08 '18 at 09:46

Unheilig

16,196
193
68
98

answered Dec 08 '18 at 09:24

suanec

11
2

score 0 · Answer 3 · answered Nov 25 '19 at 18:50

In first approach, you are trying to instantiate spark session object for each partition which is not correct.

As answered by others, use 3rd approach. But if you need to use first approach then you can use as below -

val sparkConf = new SparkConf()
dStream.foreachRDD(rdd => {
    lazy val spark = SparkSession.builder.config(sparkConf).getOrCreate()
    ...
})

Here Lazy evaluation will help not to instantiate spark session multiple times which will avoid serialization issues.

I hope this is helpful.

Serialization issues in Spark Streaming

3 Answers3