Consuming REST APIs using ParDo's processElement

Question

While I was hacking a quick CSV to Firebase upload I did just that instead of writing a custom sink. This is an over simplification of the code:

public static void main(String[] args) throws Exception {

    Options options = PipelineOptionsFactory.as(Options.class);
    Pipeline p = Pipeline.create(options);

    PCollection<String> CsvData = p.apply(TextIO.Read.from("/my_file.csv"));
    CsvData.apply(ParDo.named("Firebase").of(new DoFn<String, Void>() {
          @Override
          public void processElement(ProcessContext c) {
              Firebase fb = new Firebase("https://MYAPP.firebaseio.com/");
              fb.child("someId").setValue(c.element.getValue());
          }
        });

}

It works. Is this where a REST API should be consumed on Cloud Dataflow?

score 5 · Accepted Answer · edited May 23 '17 at 12:16

Yes, this should work, assuming that you're okay with the following caveat: the bundle may be replicated or retried multiple times in case of failures, i.e. your processElement call may be invoked on the same element multiple times, possibly concurrently.

Even though Dataflow will deduplicate the results (i.e. only one successful call's items emitted via c.output() will end up in the resulting PCollection), deduplicating the side effects (such as making an external API call) is responsibility of your code.

The custom sink API merely emphasizes these concerns and provides one "pattern" of dealing with them (by providing bundles with unique ids and providing a hook to commit the successful results - e.g. file-based sinks will have each bundle write to a uniquely-named temporary file, and the commit hook will rename the files written by successfully completed bundles to a final location) - but if your use case is not sensitive to them, then you can perfectly well use a simple ParDo.

Additionally, note that Dataflow doesn't yet have a custom sink API for streaming, so if this were a streaming pipeline, then a ParDo would definitely be the right option.

In your ParDo, you may want to implement batching of calls to Firebase to avoid the per-call overhead. You can do that using DoFn.finishBundle() (i.e. maintain the batch update in a buffer, append to it in processElement and flush it whenever it grows too large and one final time in finishBundle). See an example of a similar pattern in this answer.

I'll keep a unique id for every item streamed to avoid duplicates, so it will just overwrite itself when invoked multiple times. — Caio Iglesias, Feb 24 '16 at 23:26
Note that if you are using windowing, then maintaining a buffer in the DoFn's private state will mix data across windows, so whatever you do in finishBundle() will be operating across windows fairly arbitrarily (windows may be distributed across DoFn instances). — Kenn Knowles, Mar 02 '16 at 05:04

Consuming REST APIs using ParDo's processElement

1 Answers1