I am trying to run a batch Apache Beam job (through the TensorFlow Extended - TFX library). This is a batch job, which should just read some CSV files from S3, convert them to TFRecords format (write back to s3) and gather stats about the dataset. The pipeline tuns fine with a very small dataset (a few MB), but when I try to run it on a bigger dataset (~400 MB), the job m seems to be stuck (metrics for the number of records/bytes in the Flink UI stop to increase), while I see repeated errors in the TaskManager log:
2021-08-06 12:53:06,019 WARN org.apache.flink.runtime.taskmanager.Task [] - Partition (1/4)#0 (b8941b4b27b50d9c1d275b3e9755471a) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for f63c7fd0c9674a5faabdb45c96ba2c91.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.establishJobManagerConnection(TaskExecutor.java:1602)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1600(TaskExecutor.java:181)
at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$0(TaskExecutor.java:2173)
at java.util.Optional.ifPresent(Optional.java:159)
at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerGainedLeadership$1(TaskExecutor.java:2171)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: Found new job leader for job id f63c7fd0c9674a5faabdb45c96ba2c91.
... 28 more
The Flink cluster is version 1.13.1 and deployed as a native Kubernetes cluster on an AWS EKS cluster.
I have set the process memory for both the task managers and job manager to 26 GB, so I would assume there is no memory pressure here.
Thanks, Gorjan