3

It's similar to this question Albeit I'm running the executable jar locally (with java -jar command). I use 180G for the jvm heap max size and the exception was thrown at about 110G of usage after about 37 minutes

java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id someId timed out.
  • In the question above it is mentioned "maybe the membory of JobManager is too small". How is this translated to local execution? Is it just memory that I configure with Xmx like I wrote above (180G)?

  • I read somewhere else that I can configure the heartbeat interval of the taskmanager? How do I do that in local execution?

  • (EDIT) Where can I find the logs of the task manager in local execution?

I'm using flink1.9 by the way

EDITED logs:

18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.runtime.jobmaster.JobMaster - Trigger heartbeat request.
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.runtime.jobmaster.JobMaster - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:28:33.993 [flink-akka.actor.default-dispatcher-530] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 966e2f0da4aa05748d6c7921fb02819e.
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 966e2f0da4aa05748d6c7921fb02819e.
18:28:33.993 [flink-akka.actor.default-dispatcher-533] DEBUG o.a.f.runtime.jobmaster.JobMaster - Received heartbeat from 0895b99a-27b3-43da-8226-16031db924de.
18:28:33.993 [flink-akka.actor.default-dispatcher-530] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 0895b99a-27b3-43da-8226-16031db924de.
18:28:33.993 [flink-akka.actor.default-dispatcher-532] DEBUG o.a.f.r.r.s.SlotManagerImpl - Received slot report from instance 899db08270ab48abff5e3111979c6f07: SlotReport{slotsStatus=[SlotStatus{slotID=0895b99a-27b3-43da-8226-16031db924de_0, resourceProfile=ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647, managedMemoryInMB=165639}, allocationID=8b473e13eadfa5d9b8e34e9020536b5e, jobID=6109483ea411f1c099d0bb18a5995742}]}.
18:29:44.070 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.runtime.jobmaster.JobMaster - Trigger heartbeat request.
18:29:44.070 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Trigger heartbeat request.
18:29:44.071 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 966e2f0da4aa05748d6c7921fb02819e.
18:29:44.071 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.runtime.jobmaster.JobMaster - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO  o.a.f.r.r.StandaloneResourceManager - The heartbeat of JobManager with id 966e2f0da4aa05748d6c7921fb02819e timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.r.j.slotpool.SlotPoolImpl - Slot Pool Status:
    status: connected to akka://flink/user/resourcemanager
    registered TaskManagers: [0895b99a-27b3-43da-8226-16031db924de]
    available slots: []
    allocated slots: [[AllocatedSlot 8b473e13eadfa5d9b8e34e9020536b5e @ 0895b99a-27b3-43da-8226-16031db924de @ localhost (dataPort=-1) - 0]]
    pending requests: []
    }

18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO  o.a.f.r.r.StandaloneResourceManager - Disconnect job manager afeaa9d79956984a8a830ef4007a4315@akka://flink/user/jobmanager_1 for job 6109483ea411f1c099d0bb18a5995742 from the resource manager.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO  o.a.f.r.r.StandaloneResourceManager - The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Received heartbeat request from 2ac3bc91e5eb2b7e41baf47d97764652.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] INFO  o.a.f.r.r.StandaloneResourceManager - Closing TaskExecutor connection 0895b99a-27b3-43da-8226-16031db924de because: The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de  timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.s.SlotManagerImpl - Unregister TaskManager 899db08270ab48abff5e3111979c6f07 from the SlotManager.
18:29:44.071 [flink-akka.actor.default-dispatcher-534] DEBUG o.a.f.runtime.jobmaster.JobMaster - Disconnect TaskExecutor 0895b99a-27b3-43da-8226-16031db924de because: Heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 966e2f0da4aa05748d6c7921fb02819e.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Received heartbeat from 0895b99a-27b3-43da-8226-16031db924de.
18:29:44.071 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - Received slot report from TaskManager 0895b99a-27b3-43da-8226-16031db924de which is no longer registered.
18:29:44.071 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.taskexecutor.TaskExecutor - Close ResourceManager connection 2ac3bc91e5eb2b7e41baf47d97764652.
java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de  timed out.
    at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1144)
    at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
18:29:44.072 [flink-akka.actor.default-dispatcher-534] INFO  o.a.f.r.e.ExecutionGraph - Map (1/1) (f6b46e7b79dc5c196c14072ff765354f) switched from RUNNING to FAILED.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
    at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
    at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
18:29:44.072 [flink-akka.actor.default-dispatcher-535] DEBUG o.a.f.r.r.StandaloneResourceManager - No open TaskExecutor connection 0895b99a-27b3-43da-8226-16031db924de. Ignoring close TaskExecutor connection. Closing reason was: The heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de  timed out.
18:29:44.072 [flink-akka.actor.default-dispatcher-536] INFO  o.a.f.r.taskexecutor.TaskExecutor - Connecting to ResourceManager akka://flink/user/resourcemanager(bcfecd2d9083fd495b6f80b0a30e4507).
18:29:44.072 [flink-akka.actor.default-dispatcher-536] DEBUG o.a.f.r.rpc.akka.AkkaRpcService - Try to connect to remote RPC endpoint with address akka://flink/user/resourcemanager. Returning a org.apache.flink.runtime.resourcemanager.ResourceManagerGateway gateway.
18:29:44.072 [flink-akka.actor.default-dispatcher-534] INFO  o.a.f.r.e.ExecutionGraph - Job Flink Streaming Job (6109483ea411f1c099d0bb18a5995742) switched from state RUNNING to FAILING.
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 0895b99a-27b3-43da-8226-16031db924de timed out.
    at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
    at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
    at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Manos Ntoulias
  • 513
  • 1
  • 4
  • 21
  • 1
    This is the log from `JobManager`, this doesn't exactly tell us if anything went wrong. Please paste the logs from `TaskManager` – Dominik Wosiński Jun 04 '20 at 13:02
  • Where would that be in local execution? Is it printed on the terminal just like the logs of TaskManager? – Manos Ntoulias Jun 04 '20 at 13:14
  • By local execution do You mean running it without cluster ?:) – Dominik Wosiński Jun 04 '20 at 13:35
  • Yes. Not even standalone. I just run it with "java" in the terminal or by pressing run on the IDE – Manos Ntoulias Jun 04 '20 at 13:37
  • It can be multiple reasons why it is hapenning, can be maybe a TM`s GC that take too long, or network issues (in your case it's not rellevant because you run locally). I guess you should make debug level logs on the TM and then provide us some logs. – ShemTov Jun 06 '20 at 07:38
  • @ShemTov yes, but where are those TM logs when running locally? Are they printed alongis the JM's in the terminal? – Manos Ntoulias Jun 06 '20 at 14:03
  • did you find solution? – Atheer Abdullatif Jan 13 '21 at 14:58
  • 1
    @AtheerAbdullatif AFAIK I had a leak in memory. For the original version it would stop at 180GB (i.e, the limit). But when using flink not only it requires more memory for the same task but the resources are split among the nodes. I assume the task manager had about 110GB of the total 180GB. There are flink docs about how memory is set when using a single java process but I never got to fully apprehend it – Manos Ntoulias Apr 27 '21 at 10:25
  • set java variable of `log.file` to config the taskmanager.log location – geosmart Aug 26 '21 at 04:48
  • @ManosNtoulias Were you able to find a solution? – Prashant Chaudhary Sep 27 '21 at 09:47
  • 1
    @PrashantChaudhary It was a program specific memory leak in my code which caused the task manager to run out of memory. I fixed the leak and it's running fine now. – Manos Ntoulias Sep 27 '21 at 10:16
  • @ManosNtoulias How did you find out what piece of the code caused the memory leak? I have a similar problem with python beam pipelines and flink? Thanks – Hej Ja Jan 24 '23 at 13:16

0 Answers0