2

Running my streaming dataflow job for a longer period of time tends to end up in a "GC overhead limit exceeded" error which brings the job to a halt. How can I best proceed to debug this?

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.google.cloud.dataflow.worker.repackaged.com.google.common.collect.HashBasedTable.create (HashBasedTable.java:76)
    at com.google.cloud.dataflow.worker.WindmillTimerInternals.<init> (WindmillTimerInternals.java:53)
    at com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.start (StreamingModeExecutionContext.java:490)
    at com.google.cloud.dataflow.worker.StreamingModeExecutionContext.start (StreamingModeExecutionContext.java:221)
    at com.google.cloud.dataflow.worker.StreamingDataflowWorker.process (StreamingDataflowWorker.java:1058)
    at com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000 (StreamingDataflowWorker.java:133)
    at com.google.cloud.dataflow.worker.StreamingDataflowWorker$8.run (StreamingDataflowWorker.java:841)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
    at java.lang.Thread.run (Thread.java:745)
  • Job ID: 2018-02-06_00_54_50-15974506330123401176
  • SDK: Apache Beam SDK for Java 2.2.0
  • Scio version: 0.4.7
Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23
Brodin
  • 197
  • 1
  • 1
  • 14
  • See https://stackoverflow.com/questions/34294249/memory-profiling-on-google-cloud-dataflow/34296535#34296535 – jkff Feb 22 '18 at 17:00

1 Answers1

3

I've run into this issue a few times. My approach typically starts with trying to isolate the transform step that is causing the memory error in Dataflow. It's a longer process, but you can usually make an educated guess about which is the problematic transform. Remove the transform, execute the pipeline, and check if the error persists.

Once I determine the problematic transform, I start looking at the implementation for any memory inefficiencies. This is usually related to initializing objects (memory allocation) or design where a transform has a really high fanout; creating a bunch of output. But it could be something as trivial as string manipulation.

From here, it's just a matter of continuing to isolate the issue. Dataflow does have memory limitations. You could potentially increase the hardware of the Compute Engine instances backing the workers. However, this isn't a scalable solution.

You should also consider implementing the pipeline using ONLY Apache Beam Java. This will rule out Scio as the issue. This usually isn't the case though.

Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23