11

When I run my Dataflow pipeline, I get the exception below complaining that my DoFn can't be serialized. How do I fix this?

Here's the stack trace:

Caused by: java.lang.IllegalArgumentException: unable to serialize contrail.dataflow.AvroMRTransforms$AvroReducerDoFn@bba0fc2
    at com.google.cloud.dataflow.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:51)
    at com.google.cloud.dataflow.sdk.util.SerializableUtils.ensureSerializable(SerializableUtils.java:81)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.ensureSerializable(DirectPipelineRunner.java:784)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.evaluateHelper(ParDo.java:1025)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.evaluateSingleHelper(ParDo.java:963)
    at com.google.cloud.dataflow.sdk.transforms.ParDo.access$000(ParDo.java:441)
    at com.google.cloud.dataflow.sdk.transforms.ParDo$1.evaluate(ParDo.java:951)
    at com.google.cloud.dataflow.sdk.transforms.ParDo$1.evaluate(ParDo.java:946)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:611)
    at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:200)
    at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:196)
    at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:109)
    at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:204)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:584)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:328)
    at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:70)
    at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:145)
    at contrail.stages.DataflowStage.stageMain(DataflowStage.java:51)
    at contrail.stages.NonMRStage.execute(NonMRStage.java:130)
    at contrail.stages.NonMRStage.run(NonMRStage.java:157)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at contrail.stages.ValidateGraphDataflow.main(ValidateGraphDataflow.java:139)
    ... 6 more
Caused by: java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
    at com.google.cloud.dataflow.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:47)
    ... 27 more
Jeremy
  • 2,249
  • 2
  • 17
  • 18

2 Answers2

37

To add to what Jeremy says...

Another common cause of Serializable issues is when you use an anonymous DoFn within a non-static context. Anonymous inner classes have an implicit pointer to the enclosing class, which will cause it to get serialized as well.

Frances
  • 3,893
  • 2
  • 13
  • 14
  • 1
    Ok... This doesn't make any sense but indeed when I change DoFn from anonymous to a real class, the problem goes away. In my case, I use Kotlin rather than java. – marcoseu May 16 '19 at 18:53
15

If you scroll through the stack trace, one of the causes clearly identifies the data that isn't serializable.

Caused by: java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf

The problem was my DoFn was taking a JobConf instance in the constructor and storing it in an instance variable. I was assuming JobConf was serializable but it turns out it isn't.

To solve this I did the following

  • I marked the JobConf member variable as transient so that it wouldn't be serialized.
  • I created a separate variable of type byte[] to store a serialized version of JobConf
  • In my constructor I serialized JobConf to a byte[] and stored it in an instance variable.
  • I overrode startBundle and deserialized the JobConf from the byte[]

Here's a gist with my DoFn.

Jeremy
  • 2,249
  • 2
  • 17
  • 18
  • This solution is quite close to the one I found. My problem is serializing the Schema (avro) through the DoFn function. The approach I found is to pass the Schema string to the Function class's constructor and then parsing in the processElement() method. This approach does the schema deserialization for every entry of the PCollection to be transformed slowing down the performances, I was wondering if your solution behaves the same or it does the deserialization/parsing just once seen you do it in the startBound() method. From java doc is not specified when it gets executed. Thank you – Giuseppe Adaldo Nov 08 '16 at 17:25
  • 4
    How did you implement `serializeJobConf()` method? – Kakaji Jun 16 '17 at 08:46