0

I have a ParDo DoFn that deserializes protobuf from a bytestream. It outputs to PCollection<InputMessage> where InputMessage is a Generic (<? extends Message>). It infers the google protobuf Descriptor by a whacky method that circumvents type erasure

Type mySuperclass = this.getClass().getGenericSuperclass();
Type inputType = ((ParameterizedType) mySuperclass).getActualTypeArguments()[0];
Class<InputT> inputTClass = (Class<InputT>) inputType;

Descriptor descriptor = MessageReflector.getDescriptor(inputTClass);

and uses the descriptor to deserialize the protobuf from the bytestream. All this is working fine.

What doesn't work is a similar DoFn that doesn't have the proto message type as a generic type, but comes to know of the proto message class name only at runtime, and uses Class.forName to get the Class type and proceeds to get the Descriptor and use it to deserialize the protobuf. This DoFn outputs com.googe.protobuf.Message which is the superclass type of all protobuf classes, and its callers receive a PCollection as the output. When I run this transformation, it fails

Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize DoFnWithExecutionInformation{doFn=com.google.cloud.verticals.telco.taap.dataflow.dataingestion.businesslogic.datastrategy.ProcessExtractedDataFn@72e572ee, mainOutputTag=Tag<com.google.cloud.verticals.telco.taap.dataflow.dataingestion.businesslogic.BusinessLogicPipeline#1>, sideInputMapping={}, schemaInformation=DoFnSchemaInformation{elementConverters=[]}} at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:59) at org.apache.beam.runners.core.construction.ParDoTranslation.translateDoFn(ParDoTranslation.java:695) at org.apache.beam.runners.core.construction.ParDoTranslation$1.translateDoFn(ParDoTranslation.java:253) at org.apache.beam.runners.core.construction.ParDoTranslation.payloadForParDoLike(ParDoTranslation.java:817) at org.apache.beam.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:249) at org.apache.beam.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:210) at org.apache.beam.runners.core.construction.ParDoTranslation$ParDoTranslator.translate(ParDoTranslation.java:176) at org.apache.beam.runners.core.construction.PTransformTranslation.toProto(PTransformTranslation.java:248) at org.apache.beam.runners.core.construction.ParDoTranslation.getParDoPayload(ParDoTranslation.java:746) at org.apache.beam.runners.core.construction.ParDoTranslation.isSplittable(ParDoTranslation.java:761) at org.apache.beam.runners.core.construction.PTransformMatchers$6.matches(PTransformMatchers.java:274) at org.apache.beam.sdk.Pipeline$2.visitPrimitiveTransform(Pipeline.java:290) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:595) at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:587) at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:214) at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:469) at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:268) at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:218) at org.apache.beam.runners.direct.DirectRunner.performRewrites(DirectRunner.java:254) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:175) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309) at . . .

Caused by: java.io.NotSerializableException: com.google.protobuf.Descriptors$Descriptor at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1185) at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553) at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510) at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553) at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510) at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553) at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510) at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433) at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179) at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:55)

Why doesn't it work?

Walking Corpse
  • 107
  • 7
  • 31
  • 49
  • could you please share the DoFn code that you've tried? Just so it's clear to me where it's trying to serialize this Descriptor class – Pablo Jun 14 '22 at 20:22
  • That's is the puzzling part, no where does my code try to serialize the Descriptor. It only uses it to deserialize the proto byte stream. Just the presence of the Descriptor in the code seems to cause this problem. The breakpoint at the DoFn processMessage never gets called. The error is thrown as soon as I call pipeline.run(). In fact the top exception message says it is failed to serialize the DoFn, and only the inner exception says failed to serialize Descriptor. So it is trying to serialize "code" to give it to the workers. – Walking Corpse Jun 14 '22 at 20:39
  • right - I understand the confusion. It may be a bug, but I'd need to understand how the DoFn is defined to be able to say more. Have you tried running it remotely in a remote runner? (flink/spark/dataflow?) – Pablo Jun 14 '22 at 20:42
  • can you share the DoFn code? – Pablo Jun 15 '22 at 20:39
  • 1
    As I was working on producing a repro, I came across https://stackoverflow.com/questions/28032063/how-to-fix-dataflow-unable-to-serialize-my-dofn which gives the solution. I had to just move the non serializable types out of the DoFn member variable and onto the processMessage as a local variable. – Walking Corpse Jun 16 '22 at 09:17

0 Answers0