I have a streaming job that with initial run will have to process large amount of data. One of DoFn calls remote service that supports batch requests, so when working with bounded collections I use following approach:
private static final class Function extends DoFn<String, Void> implements Serializable {
private static final long serialVersionUID = 2417984990958377700L;
private static final int LIMIT = 500;
private transient Queue<String> buffered;
@StartBundle
public void startBundle(Context context) throws Exception {
buffered = new LinkedList<>();
}
@ProcessElement
public void processElement(ProcessContext context) throws Exception {
buffered.add(context.element());
if (buffered.size() > LIMIT) {
flush();
}
}
@FinishBundle
public void finishBundle(Context c) throws Exception {
// process remaining
flush();
}
private void flush() {
// build batch request
while (!buffered.isEmpty()) {
buffered.poll();
// do something
}
}
}
Is there a way to window data so the same approach can be used on unbounded collections?
I've tried following:
pipeline
.apply("Read", Read.from(source))
.apply(WithTimestamps.of(input -> Instant.now()))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2L))))
.apply("Process", ParDo.of(new Function()));
but startBundle
and finishBundle
are called for every element. Is there a chance to have something like with RxJava (2 minute windows or 100 element bundles):
source
.toFlowable(BackpressureStrategy.LATEST)
.buffer(2, TimeUnit.MINUTES, 100)