I have 3 GroupBys in my Java pipeline that after each GroupBy, the program runs some computation b/w stages. These groups become larger and larger blocks.The only thing, the program adds is a new key to each block.
The last GroupBy deals w/ smaller # of large blocks. Of course, the pipeline works for small # of items, but it fails at the second or third GroupBys for large # of items.
I played w/ Xms and Xmx and even chose much larger instances 'n1-standard-64', but it din't work. For the failed example, I'm sure the output is smaller than 5G, so is there any other way that I can control memory in DataFlow per map/reduce tasks?
If Dataflow can handle the first GroupBy then it should be able to reduce the number of tasks to allocate more memory on heap and handle large blocks in the next stage.
Any suggestion will be appreciated!
UPDATE:
.apply(ParDo.named("Sort Bins").of(
new DoFn<KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>, KV<Integer, KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>>>() {
@Override
public void processElement(ProcessContext c) {
KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>> e = c.element();
Integer Secondary_key = e.getKey();
ArrayList<KV<Long, Iterable<TableRow>>> records = Lists.newArrayList(e.getValue()); // Get a modifiable list.
Collections.sort(records, BinID_COMPARATOR);
Integer Primary_key= is a simple function of the secondary key;
c.output(KV.of(Primary_key, KV.of(Secondary_key, (Iterable<KV<Long, Iterable<TableRow>>>) records)));
}
}));
Error reported for the last line (c.output).