Flink expected HA behaviour when a TaskManager fails

Question

I created a HA Flink v1.2 cluster made up of 1 JobManager and 2 TaskManagers each in its own VM (not using YARN or hdfs). After I start a job on the JobManager node I kill one TaskManager instance. Immediately in the Web Dashboard I can see the job being cancelled and then failing. If I check the logs:

03/06/2017 16:23:50 Flat Map(1/2) switched to DEPLOYING 
03/06/2017 16:23:50 Flat Map(2/2) switched to SCHEDULED 
03/06/2017 16:23:50 Flat Map(2/2) switched to DEPLOYING 
03/06/2017 16:23:50 Flat Map(1/2) switched to RUNNING 
03/06/2017 16:23:50 Source: Custom Source -> Flat Map(1/2) switched to RUNNING 
03/06/2017 16:23:50 Flat Map(2/2) switched to RUNNING 
03/06/2017 16:23:50 Source: Custom Source -> Flat Map(2/2) switched to RUNNING 
03/06/2017 16:25:38 Flat Map(1/2) switched to FAILED 
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-10-106-0-238/10.106.0.238:40578'. This might indicate that the remote task manager was lost.
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:118)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at java.lang.Thread.run(Thread.java:745)

03/06/2017 16:25:38 Job execution switched to status FAILING.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-10-106-0-238/10.106.0.238:40578'. This might indicate that the remote task manager was lost.
    at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:118)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:294)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:223)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:829)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:610)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at java.lang.Thread.run(Thread.java:745)
03/06/2017 16:25:38 Source: Custom Source -> Flat Map(1/2) switched to CANCELING 
03/06/2017 16:25:38 Source: Custom Source -> Flat Map(2/2) switched to CANCELING 
03/06/2017 16:25:38 Flat Map(2/2) switched to CANCELING 
03/06/2017 16:25:38 Source: Custom Source -> Flat Map(1/2) switched to CANCELED 
03/06/2017 16:26:18 Source: Custom Source -> Flat Map(2/2) switched to CANCELED 
03/06/2017 16:26:18 Flat Map(2/2) switched to CANCELED

In the job implementation I have

env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, // number
                                                                // of
                                                                // restart
                                                                // attempts
        Time.of(10, TimeUnit.SECONDS) // delay
));

My question is shouldn't the JobManager automatically redirect all requests to the remaining / running TaskManager? Similarly if I start the JobManager and 1 TaskManager instance, and run a job, when I start the 2nd TaskManager instance should it also contribute to solve the running job?

Thanks!

score 1 · Accepted Answer · answered Mar 08 '17 at 09:27

1

First of all the RestartStrategy has nothing to do with HA mode. High-availability concerns the availability of JobManager. Anyway for HA to be working at least two instances of JobManagers are required(you said you are starting just one).

As for the RestartStrategy when you specify fixedDelayRestart strategy after a fail (as in your case when for example kill TaskManager) the job will be tried to run once again (in your case after 10 seconds). If it is not the case in your installation you are probably missing available resources for the job to be run (I suppose you have 1 task slot per TaskManager so when just one is left you can't run a job with parallelism 2 or more).

For the last question adding a TaskManager does not contribute to running jobs. Somehow connected behaviour is called dynamic scaling. You can do it by taking a savepoint and then rerunning it with more resources. Have a look here. Automatic rescaling is work in progress.

answered Mar 08 '17 at 09:27

Dawid Wysakowicz

3,402
17
33

Hi Dawid, thank you for the answers it cleared up things a bit more for me. I made a new test with the parallelism is set to 1 and each TaskManager set to 1 slot. You're right the job is retried on the remaining TaskManager but I get an ERROR Caused by: java.io.FileNotFoundException: /home/ubuntu/Prototype/flink/flink-checkpoints/6fc6168a1e5a6a27f58f6d57deeacb65/chk-37/31c325f7-2b57-4e6b-bc20-3f6e9390a724 (No such file or directory). Seems the checkpoints are not available on the second TaskManager. This causes the job to fail. Do you know if checkpoints are synchronized between TaskManagers? – razvan Mar 08 '17 at 10:51
Where the checkpoints are stored depends on the used StateBackend. For more info see: https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#state-backends – Dawid Wysakowicz Mar 08 '17 at 10:57
Sure, that is clear, I use filesystem for backend state. But I set up local paths on each TaskManager (they're in different VMs) and was expecting the framework to keep the state synchronized in the JobManager for example. Apparently it's not, if a TaskManager goes down and it was saving checkpoints locally, the job will fail. Do you know if all TaskManagers backend state localtion should point to the same path? It's generating folders with names UUIDs there, want to make sure there won't be conflicts. – razvan Mar 08 '17 at 11:54
Each TaskManager can have different path. They are independent. – Dawid Wysakowicz Mar 08 '17 at 12:02
Dawid, thank you for the answers. My last question is not completely answered but I believe it belongs to a bit different topic. If you can please have a look http://stackoverflow.com/questions/42672579/flink-state-backend-for-taskmanager Best Regards – razvan Mar 08 '17 at 13:29

Flink expected HA behaviour when a TaskManager fails

1 Answers1