"GC overhead limit exceeded" for long running streaming dataflow job

Question

Running my streaming dataflow job for a longer period of time tends to end up in a "GC overhead limit exceeded" error which brings the job to a halt. How can I best proceed to debug this?

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at com.google.cloud.dataflow.worker.repackaged.com.google.common.collect.HashBasedTable.create (HashBasedTable.java:76)
    at com.google.cloud.dataflow.worker.WindmillTimerInternals.<init> (WindmillTimerInternals.java:53)
    at com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.start (StreamingModeExecutionContext.java:490)
    at com.google.cloud.dataflow.worker.StreamingModeExecutionContext.start (StreamingModeExecutionContext.java:221)
    at com.google.cloud.dataflow.worker.StreamingDataflowWorker.process (StreamingDataflowWorker.java:1058)
    at com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000 (StreamingDataflowWorker.java:133)
    at com.google.cloud.dataflow.worker.StreamingDataflowWorker$8.run (StreamingDataflowWorker.java:841)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617)
    at java.lang.Thread.run (Thread.java:745)

Job ID: 2018-02-06_00_54_50-15974506330123401176
SDK: Apache Beam SDK for Java 2.2.0
Scio version: 0.4.7

See https://stackoverflow.com/questions/34294249/memory-profiling-on-google-cloud-dataflow/34296535#34296535 — jkff, Feb 22 '18 at 17:00

score 3 · Answer 1 · answered Feb 22 '18 at 15:50

I've run into this issue a few times. My approach typically starts with trying to isolate the transform step that is causing the memory error in Dataflow. It's a longer process, but you can usually make an educated guess about which is the problematic transform. Remove the transform, execute the pipeline, and check if the error persists.

Once I determine the problematic transform, I start looking at the implementation for any memory inefficiencies. This is usually related to initializing objects (memory allocation) or design where a transform has a really high fanout; creating a bunch of output. But it could be something as trivial as string manipulation.

From here, it's just a matter of continuing to isolate the issue. Dataflow does have memory limitations. You could potentially increase the hardware of the Compute Engine instances backing the workers. However, this isn't a scalable solution.

You should also consider implementing the pipeline using ONLY Apache Beam Java. This will rule out Scio as the issue. This usually isn't the case though.

"GC overhead limit exceeded" for long running streaming dataflow job

1 Answers1