0

I have 3 GroupBys in my Java pipeline that after each GroupBy, the program runs some computation b/w stages. These groups become larger and larger blocks.The only thing, the program adds is a new key to each block.

The last GroupBy deals w/ smaller # of large blocks. Of course, the pipeline works for small # of items, but it fails at the second or third GroupBys for large # of items.

I played w/ Xms and Xmx and even chose much larger instances 'n1-standard-64', but it din't work. For the failed example, I'm sure the output is smaller than 5G, so is there any other way that I can control memory in DataFlow per map/reduce tasks?

If Dataflow can handle the first GroupBy then it should be able to reduce the number of tasks to allocate more memory on heap and handle large blocks in the next stage.

Any suggestion will be appreciated!

UPDATE:

.apply(ParDo.named("Sort Bins").of(
          new DoFn<KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>, KV<Integer, KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>>>() {

            @Override
              public void processElement(ProcessContext c) {

                KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>> e = c.element();
                Integer Secondary_key = e.getKey();

                ArrayList<KV<Long, Iterable<TableRow>>> records = Lists.newArrayList(e.getValue());  // Get a modifiable list.
                Collections.sort(records, BinID_COMPARATOR);


                Integer Primary_key= is a simple function of the secondary key;
            c.output(KV.of(Primary_key, KV.of(Secondary_key, (Iterable<KV<Long, Iterable<TableRow>>>) records))); 

        }                   
      }));

Error reported for the last line (c.output).

AmirCS
  • 321
  • 1
  • 2
  • 14
  • Please include a job ID and a more detailed description of what your pipeline is doing (ideally, if possible, a code snippet of the step that is failing). GroupBy on its own typically can't cause OOMs - memory is used by the code that does something with each key/values group. Seeing the code would help make suggestions on improving its memory usage. – jkff Oct 12 '17 at 19:01
  • Thanks Eugene. I updated the post. Here is the job id: 2017-10-12_12_04_18-11267371215481669185 – AmirCS Oct 12 '17 at 19:24
  • OK, I see. Have you considered logging the amount of data in your ProcessElement calls? (number of elements at each level of your iterables) - note also that something that is reported by Dataflow as 5G in encoded form may take much more in Java memory. – jkff Oct 12 '17 at 20:06
  • I have lots of free memory per each task (Log reports). Let's suppose it's more that 5G, what would you do to handle this problem dynamically? – AmirCS Oct 12 '17 at 21:02
  • Maybe try using the SortValues transform, which uses external-memory sorting to sort things that don't fit in main memory? It seems like it does exactly what your "Sort Bins" does. It also seems like you're using an old SDK version; I'm not sure that transform is present there. – jkff Oct 12 '17 at 21:26
  • I also looked at your job's execution, and it is definitely running out of memory because of a large amount of data on a single key. I can't see which key it is, but logging should help you diagnose that. I really recommend to add more logging - just looking at amount of logged free memory may not tell you enough, memory usage can spike and OOM suddenly without being logged by that periodic logging. – jkff Oct 12 '17 at 21:32
  • Oh, one more thing: the TableRow type is extremely inefficient (it is JSON, basically). Try converting your data to something more specific and less memory-hungry (e.g. some custom type) immediately after ingesting it from BigQuery. – jkff Oct 12 '17 at 21:34
  • Thanks so much Eugene! I upgraded the SDK and also changed TableRow. It worked. Next step is to use "SortValues" for larger files. The only issue that I see right now is transferring records from BigQuery to Dataflow is much slower using the new SDK. For BigQueryIO.Read, it says "Part Running" but no elements added. After like 3 to 4 mins, it starts loading elements. The version of BigQuery I'm using is: v2-rev354-1.22.0 --- Here is the job ID: "2017-10-13_12_09_43-11052447607999095915" Thanks! – AmirCS Oct 13 '17 at 19:23
  • BTW, any suggestions for [this question](https://stackoverflow.com/questions/46715860/dataflow-groupby-multiple-outputs-based-on-keys) . Thanks! – AmirCS Oct 13 '17 at 19:25
  • About bigquery performance: due to various optimizations, performance comparisons make sense only between complete end to end runs; do you have an earlier run of essentially the same job on similar data that is much faster? – jkff Oct 14 '17 at 00:45
  • That's right, Eugene. I didn't have numbers for the old pipeline, but I tested another pipeline and the results are almost the same for the old SDK and the new one. Thanks! – AmirCS Oct 16 '17 at 19:00
  • Hi @jkff, which version of Apache Beam supports Sorter? I cannot find it in 2.3.0. `https://github.com/apache/beam/tree/master/sdks/java/extensions/sorter/src/main/java/org/apache/beam/sdk/extensions/sorter` – AmirCS Feb 20 '18 at 07:58
  • Also, would you please post a link to an example of using this library? thanks a lot – AmirCS Feb 20 '18 at 08:06
  • Sorter exists in 2.3.0, here's the artifact https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-extensions-sorter/2.3.0 . Have you looked at its unit tests for examples? – jkff Feb 20 '18 at 15:17

0 Answers0