2

I'm trying to deduplicate input messages from Google Cloud Pubsub using deduplication function of Apache beam. However, I run into an error after creating KV<String, MyModel> pair and passing it to Deduplicate transform.

Error:

ParDo requires a deterministic key coder in order to use state and timers

Code:

PCollection<KV<String, MyModel>> deduplicatedEvents =
    messages
        .apply(
            "CreateKVPairs",
            ParDo.of(
                new DoFn<MyModel, KV<String, MyModel>>() {
                  @ProcessElement
                  public void processElement(ProcessContext c) {
                    c.output(KV.of(c.element().getUniqueKey(),c.element()));
                  }
                }))
        .apply(
            "Deduplicate",
            Deduplicate.<KV<String, MyModel>>values());

How should I create deterministic coder which can encode/decode string as key, to make this work?

Any input would be really helpful.

Kenn Knowles
  • 5,838
  • 18
  • 22
  • Hey Kyle, does this help you out? https://stackoverflow.com/questions/57208405/how-to-add-de-duplication-to-a-streaming-pipeline-apache-beam – Cubez Aug 03 '20 at 18:17
  • Did you try using KeyedValues? https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/transforms/Deduplicate.KeyedValues.html – rmesteves Aug 06 '20 at 15:31

1 Answers1

4

The Deduplicate transform works by putting the whole element into the key and then doing a key grouping operation (in this case a stateful ParDo). Because Beam is language-independent, grouping by key is done using the encoded form of elements. Two elements that encode to the same bytes are "equal" while two elements that encode to different bytes are "unequal".

A deterministic coder is a concept about how equality in a language (like Java) relates to Beam equality. It means that if two Java objects are equal according to Java equals() then they must have the same encoded bytes. For simple data like strings, numbers, arrays, this is easy. It is helpful to think about what makes a coder non-deterministic. For example, when encoding two Map instances, they may be equals() at the Java level but the key-value pairs are encoded in a different order making them unequal for Beam.

If you have a nondeterministic coder for MyModel, then Deduplicate will not work right and you will end up with duplicates because Beam considers the differently encoded objects to be unequal.

Probably the easiest way to automatically get a high quality deterministic coder is to leverage Beam's schema inference: https://beam.apache.org/documentation/programming-guide/#schemas-for-pl-types. You will need to ensure that all the fields can also be encoded deterministically.

Kenn Knowles
  • 5,838
  • 18
  • 22
  • Thanks, this was very helpful. For my use-case, I ended up changing the design of the pipeline where I would not require to de-duplicate custom model but rather strings. Nonetheless, your answer puts things in perspective, now I know what should be approach if in case, I've to de-duplicate custom models. – kylebutters Aug 14 '20 at 06:08