3

I want to join two PCollection (from a different input respectively) and implement by following the step described here, "Joins with CoGroupByKey" section: https://cloud.google.com/dataflow/model/group-by-key

In my case, I want to join GeoIP's "block" information and "location" information. So I defined Block and Location as a custom class and then wrote like below:

final TupleTag<Block> t1 = new TupleTag<Block>();
final TupleTag<Location> t2 = new TupleTag<Location>();
PCollection<KV<Long, CoGbkResult>> coGbkResultColl = KeyedPCollectionTuple.of(t1, kvGeoNameIDBlock)
        .and(t2, kvGeoNameIDLocation).apply(CoGroupByKey.<Long>create());

A key has a Long type value. I thought it's done but when I run mvn compile, it outputs a following error:

[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project xxxx: An exception occured while executing the Java class. null: InvocationTargetException: Unable to return a default Coder for Extract GeoNameID-Block KV/ParMultiDo(ExtractGeoNameIDBlock).out0 [PCollection]. Correct one of the following root causes:
[ERROR]   No Coder has been manually specified;  you may do so using .setCoder().
[ERROR]   Inferring a Coder from the CoderRegistry failed: Cannot provide coder for parameterized type org.apache.beam.sdk.values.KV<java.lang.Long, com.xxx.platform.geoip2.Block>: Unable to provide a Coder for com.xxx.platform.geoip2.Block.
[ERROR]   Building a Coder using a registered CoderProvider failed.
[ERROR]   See suppressed exceptions for detailed failures.
[ERROR]   Using the default output Coder from the producing PTransform failed: Cannot provide coder for parameterized type org.apache.beam.sdk.values.KV<java.lang.Long, com.xxx.platform.geoip2.Block>: Unable to provide a Coder for com.xxx.platform.geoip2.Block.

The exact DoFn which outputs an error is ExtractGeoNameIDBlock, which simply creates a key-value pair of its key (to be joined) and itself.

// ExtractGeoNameIDBlock creates KV collection while reading from block CSV
static class ExtractGeoNameIDBlock extends DoFn<String, KV<Long, Block>> {
private static final long serialVersionUID = 1L;

  @ProcessElement
  public void processElement(ProcessContext c) throws Exception {
    String line = c.element();

    if (!line.startsWith("network,")) { // exclude headerline
      Block b = new Block();
      b.loadFromCsvLine(line);

      if (b.getGeonameId() != null) {
        c.output(KV.of(b.getGeonameId(), b));
      }
    }
  }
}

loadFromCsvLine just parse CSV line, convert fields to each corresponding type and assign to its private fields.

So it looks I need to set some coder to my custom class to make it work. I found a document referring the coder but still not sure how I can implement mine. https://cloud.google.com/dataflow/model/data-encoding

Is there any real example that I can follow to create a custom coder for my custom class?

[Update 13:02 09/26/2017] I added

CoderRegistry cr = p.getCoderRegistry();
cr.registerCoderForClass(Block.class, AvroCoder.of(Block.class));

And then got an error

 java.lang.NullPointerException: in com.xxx.platform.geoip2.Block in long null of long in field representedCountryGeonameId of com.xxx.platform.geoip2.Block

[Update 14:05 09/26/2017] I changed the implementation like this:

@DefaultCoder(AvroCoder.class)
public class Block {
    private static final Logger LOG = LoggerFactory.getLogger(Block.class);

    @Nullable
    public String network;
    @Nullable
    public Long registeredCountryGeonameId;
:
:

(Set @Nullable to all properties)

But still got this error:

(22eeaf3dfb26f8cc): java.lang.RuntimeException: org.apache.beam.sdk.coders.CoderException: cannot encode a null Long
    at com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:191)
    at org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:211)
    at org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:66)
    at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:436)
    at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:424)
    at org.apache.beam.sdk.transforms.join.CoGroupByKey$ConstructUnionTableFn.processElement(CoGroupByKey.java:185)
Caused by: org.apache.beam.sdk.coders.CoderException: cannot encode a null Long
    at org.apache.beam.sdk.coders.VarLongCoder.encode(VarLongCoder.java:51)
    at org.apache.beam.sdk.coders.VarLongCoder.encode(VarLongCoder.java:35)
    at org.apache.beam.sdk.coders.Coder.encode(Coder.java:135)
    at com.google.cloud.dataflow.worker.ShuffleSink$ShuffleSinkWriter.encodeToChunk(ShuffleSink.java:320)
    at com.google.cloud.dataflow.worker.ShuffleSink$ShuffleSinkWriter.add(ShuffleSink.java:216)
    at com.google.cloud.dataflow.worker.ShuffleSink$ShuffleSinkWriter.add(ShuffleSink.java:178)
    at com.google.cloud.dataflow.worker.util.common.worker.WriteOperation.process(WriteOperation.java:80)
    at com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
    at com.google.cloud.dataflow.worker.ReifyTimestampAndWindowsParDoFnFactory$ReifyTimestampAndWindowsParDoFn.processElement(ReifyTimestampAndWindowsParDoFnFactory.java:68)
    at com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:48)
    at com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
    at com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:183)
    at org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:211)
    at org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:66)
    at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:436)
    at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:424)
    at org.apache.beam.sdk.transforms.join.CoGroupByKey$ConstructUnionTableFn.processElement(CoGroupByKey.java:185)
    at org.apache.beam.sdk.transforms.join.CoGroupByKey$ConstructUnionTableFn$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
    at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
    at com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:233)
    at com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:48)
    at com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
    at com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:183)
    at org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:211)
    at org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:66)
    at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:436)
    at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:424)
    at com.bandainamcoent.platform.GeoIpPopulateTable$ExtractGeoNameIDBlock.processElement(GeoIpPopulateTable.java:79)
    at com.bandainamcoent.platform.GeoIpPopulateTable$ExtractGeoNameIDBlock$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
    at org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
    at com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:233)
    at com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:48)
    at com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
    at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:187)
    at com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:148)
    at com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:68)
    at com.google.cloud.dataflow.worker.DataflowWorker.executeWork(DataflowWorker.java:336)
    at com.google.cloud.dataflow.worker.DataflowWorker.doWork(DataflowWorker.java:294)
    at com.google.cloud.dataflow.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:244)
    at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:135)
    at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:115)
    at com.google.cloud.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:102)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Thanks.

Norio Akagi
  • 705
  • 1
  • 8
  • 22
  • Can you share more of the pipeline? The error suggests the problem is in a ParDo called "ExtractGeoNameIDBlock" not the CoGroupByKey. – Ben Chambers Sep 26 '17 at 16:55
  • Thanks @BenChambers, I added the code. But I think anyway I need to add Coder to my custom class when I use custom class in DoFn because in some process of a pipeline, it may outputs data into file, so it needs to be encoded/decoded. Is my understanding correct? – Norio Akagi Sep 26 '17 at 17:40

2 Answers2

1

It looks like your custom class Block doesn't have a coder specified. You can create your own Coder, or use one of the general ones such as AvroCoder. You should also register it with the CoderRegistry so the pipeline knows how to encode Blocks.

Ben Chambers
  • 6,070
  • 11
  • 16
  • Thank you for the post! I updated my question. After I specify AvroCoder it outputs a NullPointerException for some field which may possibly be NULL. Is there any way to tell AvroCoder explicitly that some fields are NULLable? – Norio Akagi Sep 26 '17 at 20:06
  • Ah..maybe this is relevent. https://stackoverflow.com/a/33443609/2543803 Let me try this first. – Norio Akagi Sep 26 '17 at 20:13
  • Hi, I could successfully run my pipeline using AvroCoder! Thank you so much for the help :-) – Norio Akagi Sep 26 '17 at 23:32
0

I finally made it by using AvroCoder + Nullable annotations as I posted in update at 14:05 09/26/2017 in my question.

The last error I saw occurred just because my data actually has a null value which I didn't expect. After I handled null value in my Java code, everything works fine.

I think this post on another question is very useful for this problem: https://stackoverflow.com/a/32342403/2543803

Norio Akagi
  • 705
  • 1
  • 8
  • 22