Understanding Spark's closures and their serialization

Question

Disclaimer: just starting to play with Spark.

I'm having troubles understanding the famous "Task not serializable" exception but my question is a little different from those I see on SO (or so I think).

I have a tiny custom RDD (TestRDD). It has a field which stores objects whose class does not implement Serializable (NonSerializable). I've set the "spark.serializer" config option to use Kryo. However, when I try count() on my RDD, I get the following:

Caused by: java.io.NotSerializableException: com.complexible.spark.NonSerializable
Serialization stack:
- object not serializable (class: com.test.spark.NonSerializable, value: com.test.spark.NonSerializable@2901e052)
- field (class: com.test.spark.TestRDD, name: mNS, type: class com.test.spark.NonSerializable)
- object (class com.test.spark.TestRDD, TestRDD[1] at RDD at TestRDD.java:28)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (TestRDD[1] at RDD at TestRDD.java:28,<function2>))
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1009)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:933)

When I look inside DAGScheduler.submitMissingTasks I see that it uses its closure serializer on my RDD, which is the Java serializer, not the Kryo serializer which I'd expect. I've read that Kryo has issues serializing closures and Spark always uses the Java serializer for closures but I don't quite understand how closures come into play here at all. All I'm doing here is this:

SparkConf conf = new SparkConf()
                         .setAppName("ScanTest")
                         .setMaster("local")
                         .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

JavaSparkContext sc = new JavaSparkContext(conf);

TestRDD rdd = new TestRDD(sc.sc());
System.err.println(rdd.count());

That is, no mappers or anything which would require serialization of closures. OTOH this works:

sc.parallelize(Arrays.asList(new NonSerializable(), new NonSerializable())).count()

The Kryo serializer is used as expected, the closure serializer is not involved. If I didn't set the serializer property to Kryo, I'd get an exception here as well.

I appreciate any pointers explaining where the closure comes from and how to ensure that I can use Kryo to serialize custom RDDs.

UPDATE: here's TestRDD with its non-serializable field mNS:

class TestRDD extends RDD<String> {

    private static final ClassTag<String> STRING_TAG = ClassManifestFactory$.MODULE$.fromClass(String.class);

    NonSerializable mNS = new NonSerializable();

    public TestRDD(final SparkContext _sc) {
        super(_sc,
              JavaConversions.asScalaBuffer(Collections.<Dependency<?>>emptyList()),
              STRING_TAG);
    }

    @Override
    public Iterator<String> compute(final Partition thePartition, final TaskContext theTaskContext) {
        return JavaConverters.asScalaIteratorConverter(Arrays.asList("test_" + thePartition.index(),
                                                                     "test_" + thePartition.index(),
                                                                     "test_" + thePartition.index()).iterator()).asScala();
    }

    @Override
    public Partition[] getPartitions() {
        return new Partition[] {new TestPartition(0), new TestPartition(1), new TestPartition(2)};
    }

    static class TestPartition implements Partition {

        final int mIndex;

        public TestPartition(final int theIndex) {
            mIndex = theIndex;
        }

        public int index() {
            return mIndex;
        }
    }
}

Do you have a field that holds the `SparkContext` in your `TestRDD`? Show us your definition of `TestRDD` or create a [MCVE] — Yuval Itzchakov, Oct 26 '16 at 11:20
@YuvalItzchakov there it is. `SparkContext` is passed to the super's constructor so yes, the RDD does hold it. The exception doesn't seem to complain about that though. — Pavel Klinov, Oct 26 '16 at 11:33

Yuval Itzchakov · Answer 1 · 2016-10-26T20:58:21.817

11

When I look inside DAGScheduler.submitMissingTasks I see that it uses its closure serializer on my RDD, which is the Java serializer, not the Kryo serializer which I'd expect.

SparkEnv supports two serializers, one named serializer which is used for serialization of your data, checkpointing, messaging between workers, etc and is available under spark.serializer configuration flag. The other is called closureSerializer under spark.closure.serializer which is used to check that your object is in fact serializable and is configurable for Spark <= 1.6.2 (but nothing other than JavaSerializer actually works) and hardcoded from 2.0.0 and above to JavaSerializer.

The Kryo closure serializer has a bug which make it unusable, you can see that bug under SPARK-7708 (this may be fixed with Kryo 3.0.0, but Spark is currently fixed with a specific version of Chill which is fixed on Kryo 2.2.1). Further, for Spark 2.0.x the JavaSerializer is now fixed instead of configurable (you can see it in this pull request). This mean that effectively we're stuck with the JavaSerializer for closure serialization.

Is this weird that we're using one serializer to submit tasks and other to serialize data between workers and such? definitely, but this is what we have.

To sum up, if you're setting the spark.serializer configuration, or using SparkContext.registerKryoClasses you'll be utilizing Kryo for most of your serialization in Spark. Having said that, for checking if a given class is serializable and serialization of tasks to workers, Spark will use JavaSerializer.

edited Oct 26 '16 at 20:58

answered Oct 26 '16 at 11:51

Yuval Itzchakov

146,575
32
257
321

Thanks, but how's it incorrect? I can see that `DAGScheduler` uses the `closureSerializer` field, not the `serializer` field. Regardless if I set the environment to use Kryo or not, `SparkEnv.get.closureSerializer` is always the Java serializer (iirc, they even pulled the option `spark.closure.serializer` out of 2.0 since it was ignored anyway) so I see why it fails. The question is different: why the scheduler uses the closure serializer in my case? How do I get it to use Kryo for my RDD> – Pavel Klinov Oct 26 '16 at 12:23
hm, I believe that this statement is correct for Spark 2.0.0 and 2.0.1 (as evident from the stack trace). You can also check SPARK-12414. `closureSerializer ` may be of an abstract type but AFAICT only one implementation is used. – Pavel Klinov Oct 26 '16 at 12:39
@PavelKlinov You're right. I dug a little deeper, see my update. – Yuval Itzchakov Oct 26 '16 at 13:04
OK, thanks. I have to say if you're correct, this sounds a bit strange. One of the reasons for Kryo (in addition to speed/size) is being able to deal with objects which aren't `Serializable` (and can't be made such). Kryo can indeed serialize them. But if Spark doesn't even get to the point where Kryo would serialize them, then what's the point... Also it seems like we use the term "closure" differently: I meant Scala closures and you seem to mean the transitive closure of objects reachable from the RDD. – Pavel Klinov Oct 26 '16 at 13:24
@PavelKlinov What do you mean by *"I Meant Scala closures*". Are you referring to fields lifted onto the function in order to be available at computation time? – Yuval Itzchakov Oct 26 '16 at 13:26
@PavelKlinov *But if Spark doesn't even get to the point where Kryo would serialize them, then what's the point.* Kryo is being used for serialization, it simply isn't being used to check that an object is validly serializable. I agree that this is a quirk. If we're using Kryo to be serializing a class between workers, then why validate it according to JavaSerializer rules? – Yuval Itzchakov Oct 26 '16 at 13:34
Yeah, I thought `closureSerializer` is needed to serialize closures -- functions whose return value depends on variables declared out of its scope. E.g. when one passes such things to, say, `map(..)`. Perhaps that's wrong. Anyway, assuming you're right and it's used to check serializability, do you have a suggestion how to make my example to work without marking `NonSerializable` with `Serializable`? I can only think of creating some serializable wrapper around it... – Pavel Klinov Oct 26 '16 at 13:36
2

@PavelKlinov If you have any property which isn't serializable, a common practice is to mark it as `@transient` and have the workers load it lazily. – Yuval Itzchakov Oct 26 '16 at 13:42
I am still facing this issue in 2023. Is there a suggested strategy to avoid such issues ? – Deepak Patankar Apr 20 '23 at 09:27
@DeepakPatankar Unfortunately, same strategies apply :( – Yuval Itzchakov Apr 20 '23 at 10:38

Understanding Spark's closures and their serialization

1 Answers1

Linked