0

I'm working with Kinesis Application using Flink 1.11, but I got the following error when start my application:

java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id 421563c271e57acb4592f9d447d45b42  timed out.
    at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1202)
    at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I'm using the default configuration in Kinesis and 4 KPUs.

2 Answers2

0

Judging from the error report, the heartbeat of TM and JM has timed out, maybe you can give the complete configuration information of the Job.

There are many reasons for the timeout between JM and TM, such as large concurrent tasks, high JM load, etc.

Maybe you can also adjust the heartbeat.timeout parameter, refer to https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/

ChangLi
  • 772
  • 2
  • 8
  • Unfortunately I'm using the AWS Kinesis Application implementation and based on the AWS documentation Kinesis uses the default configuration described in Apache Flink documentation. I also read that most of the configuration parameters are not modifiable. – Felipe Jorquera Uribe Dec 29 '21 at 21:09
  • If you can't adjust the timeout, giving the cluster more resources (e.g., increasing the parallelism) should make timeouts less likely. – David Anderson Dec 30 '21 at 11:11
0

Finally I solved this issue. In my case the problem was that I was reading a CSVSource too large for Flink workers. I solved this using dynamodb as data source.

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 13 '22 at 00:04