Flink batch job fails with bigger files

Question

I am trying to run a batch Apache Beam job (through the TensorFlow Extended - TFX library). This is a batch job, which should just read some CSV files from S3, convert them to TFRecords format (write back to s3) and gather stats about the dataset. The pipeline tuns fine with a very small dataset (a few MB), but when I try to run it on a bigger dataset (~400 MB), the job m seems to be stuck (metrics for the number of records/bytes in the Flink UI stop to increase), while I see repeated errors in the TaskManager log:

2021-08-06 12:53:06,019 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - Partition (1/4)#0 (b8941b4b27b50d9c1d275b3e9755471a) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for f63c7fd0c9674a5faabdb45c96ba2c91.
    at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
    at org.apache.flink.runtime.taskexecutor.TaskExecutor.establishJobManagerConnection(TaskExecutor.java:1602)
    at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1600(TaskExecutor.java:181)
    at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$0(TaskExecutor.java:2173)
    at java.util.Optional.ifPresent(Optional.java:159)
    at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerGainedLeadership$1(TaskExecutor.java:2171)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: Found new job leader for job id f63c7fd0c9674a5faabdb45c96ba2c91.
    ... 28 more

The Flink cluster is version 1.13.1 and deployed as a native Kubernetes cluster on an AWS EKS cluster.

I have set the process memory for both the task managers and job manager to 26 GB, so I would assume there is no memory pressure here.

Thanks, Gorjan

Just a guess, but I suspect that some threads are busy enough that something is timing out, and then the restart isn't succeeding. I suggest searching the logs for evidence of timeouts, and reviewing the timeouts in https://ci.apache.org/projects/flink/flink-docs-stable/docs/deployment/config/ to see which might be worth increasing. Perhaps `akka.ask.timeout`, `akka.lookup.timeout`, `akka.tcp.timeout`, and/or `heartbeat.timeout`? — David Anderson, Aug 06 '21 at 20:34
@DavidAnderson, you are probably right. I found this in the logs repeated, so I will check all the timeout settings: ```2021-08-06 20:47:22,487 ERROR /usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py:291 [] - Error processing instruction 32. Original traceback is Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 459, in input_elements element = received.get(timeout=1) File "/usr/local/lib/python3.7/queue.py", line 178, in get raise Empty _queue.Empty ``` — Gorjan Todorovski, Aug 07 '21 at 05:23
@GorjanTodorovski Did you end up finding a solution to this? I'm stuck on exactly the same problem. Thanks! — Ayman Farhat, Nov 19 '22 at 14:56
It was some time ago, but as I remember, the problem ended up being that, when running with multiple task managers due my wrong config> Correct one which solved the issue is setting this parameter when launching the job: --environment_config=localhost:50000 — Gorjan Todorovski, Nov 27 '22 at 18:28

Flink batch job fails with bigger files

0 Answers0