11

I am just getting started with Google Data Flow, I have written a simple flow that reads a CSV file from cloud storage. One of the steps involves calling a web service to enrich results. The web service in question performs much better when sending several 100 requests in bulk.

In looking at API I don't see a great way to aggregate 100 elements of a PCollection into a single Par.do Execution. The results would need to be then split to handle the last step of the flow which is writing to a BigQuery table.

Not sure if I need to use windowing is what I want. Most of the windowing examples I see are more geared towards counting over a given time period.

Jeffrey Ellin
  • 586
  • 1
  • 3
  • 17

2 Answers2

8

You can buffer elements in a local member variable of your DoFn, and call your web service when the buffer is large enough, as well as in finishBundle. For example:

class CallServiceFn extends DoFn<String, String> {
  private List<String> elements = new ArrayList<>();

  public void processElement(ProcessContext c) {
    elements.add(c.element());
    if (elements.size() >= MAX_CALL_SIZE) {
      for (String result : callServiceWithData(elements)) {
        c.output(result);
      }
      elements.clear();
    }
  }

  public void finishBundle(Context c) {
    for (String result : callServiceWithData(elements)) {
      c.output(result);
    }
  }
}
danielm
  • 3,000
  • 10
  • 15
  • Whats the best way to return data back to the pipeline. My service returns an array list of results, ideally I would then like to chop that resultset into individual elements. – Jeffrey Ellin May 11 '15 at 22:37
  • I've edited my post to show out to output results from the service call – danielm May 11 '15 at 22:47
  • When running a batch application, does the DoFn.finishBundle() method run when a certain number of records are reached or is the lifecycle for the entire data set? I assume you are using finishBundle to catch any records that are left over. – Jeffrey Ellin May 11 '15 at 23:28
  • finishBundle is called at the end of each bundle of elements. The bundles are of unspecified size but, in a batch pipeline, correspond roughly to one worker thread's share of the data. – danielm May 11 '15 at 23:53
  • Im trying to replicate this code in python but not sure how to clear the list correctly. For example, in `process()` if I do `self.elements = []` it works correctly however `del self.elements[:]` will lead to strange results. – sthomps Apr 11 '16 at 15:01
  • `del self.elements[:]` (or `elements.clear()` in the java example) could lead to strange results if `callServiceWithData` keeps a handle to `elements` beyond the lifetime of that call. – robertwb Jan 27 '17 at 05:37
0

Note that a GroupIntoBatches transform was added to make this even easier.

robertwb
  • 4,891
  • 18
  • 21