Flink TaskManager Docker Swarm doesn't recover

Question

I'm Running a Flink v1.10 with 1 JobManager and 3 Taskmanagers in Docker Swarm, without Zookeeper. I've a Job running taking 12 Slots and i've 3 TM's with 20 Slots each (60 total). After some tests everything went well except one test.

So, the test failing is, if i cancel the job manually i've a side-car retrying the job and the Taskmanager on the Browser Console doesn't recover and keeps decreasing.

More pratical example, so, i've a job running, consuming 12 slots of 60 total.

The web console shows me 48 Slots free and 3 TM's.
I cancel the job manually the side-car retriggers the job and the web console shows me 36 Slots free and 2 TM's
The job enter's in a fail state and the Slot's will keep dreasing until 0 Slots free and 1 TM shows on the console.
The solution is scale down and scale up all the 3 TM's and everything get back to normal.

Everything work's fine with this configuration, the jobmanager recover's if i remove it, or if i scale up or down the TM's, but if i cancel the job the TM's looks like they loose the connection to the JM.

Any suggestions what i'm doing wrong?

Here is my flink-conf.yaml.



env.java.home: /usr/local/openjdk-8
env.log.dir: /opt/flink/
env.log.file: /var/log/flink.log
jobmanager.rpc.address: jobmanager1
jobmanager.rpc.port: 6123

jobmanager.heap.size: 2048m

#taskmanager.memory.process.size: 2048m

#env.java.opts.taskmanager: 2048m
taskmanager.memory.flink.size: 2048m

taskmanager.numberOfTaskSlots: 20

parallelism.default: 2


#==============================================================================
# High Availability
#==============================================================================

# The high-availability mode. Possible options are 'NONE' or 'zookeeper'.
#
high-availability: NONE

#high-availability.storageDir: file:///tmp/storageDir/flink_tmp/
#high-availability.zookeeper.quorum: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
#high-availability.zookeeper.quorum:


# ACL options are based on https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#sc_BuiltinACLSchemes
# high-availability.zookeeper.client.acl: open

#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================

# state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
# state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints
# state.backend.incremental: false

jobmanager.execution.failover-strategy: region

#==============================================================================
# Rest & web frontend
#==============================================================================

rest.port: 8080
rest.address: jobmanager1
# rest.bind-port: 8081
rest.bind-address: 0.0.0.0
#web.submit.enable: false

#==============================================================================
# Advanced
#==============================================================================

# io.tmp.dirs: /tmp
# classloader.resolve-order: child-first

# taskmanager.memory.network.fraction: 0.1
# taskmanager.memory.network.min: 64mb
# taskmanager.memory.network.max: 1gb

#==============================================================================
# Flink Cluster Security Configuration
#==============================================================================

# security.kerberos.login.use-ticket-cache: false
# security.kerberos.login.keytab: /mobi.me/flink/conf/smart3.keytab
# security.kerberos.login.principal: smart_user

# security.kerberos.login.contexts: Client,KafkaClient

#==============================================================================
# ZK Security Configuration
#==============================================================================

# zookeeper.sasl.login-context-name: Client

#==============================================================================
# HistoryServer
#==============================================================================

#jobmanager.archive.fs.dir: hdfs:///completed-jobs/
#historyserver.web.address: 0.0.0.0
#historyserver.web.port: 8082
#historyserver.archive.fs.dir: hdfs:///completed-jobs/
#historyserver.archive.fs.refresh-interval: 10000

blob.server.port: 6124
query.server.port: 6125
taskmanager.rpc.port: 6122
high-availability.jobmanager.port: 50010
zookeeper.sasl.disable: true
#recovery.mode: zookeeper
#recovery.zookeeper.quorum: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
#recovery.zookeeper.path.root: /
#recovery.zookeeper.path.namespace: /cluster_one

Could you share the cluster logs with us? Moreover, how does the Docker Swarm configuration look like? — Till Rohrmann, Apr 15 '20 at 13:55
Hi Till, i posted the solution on the original, i needed to increase the metaspace size. Thanks for the prompt help. — Andre Carvalho, Apr 15 '20 at 16:49

score 0 · Answer 1 · answered Apr 20 '20 at 15:49

0

The solution was to increate the metaspace size in the flink-conf.yaml.

Br, André.

answered Apr 20 '20 at 15:49

Andre Carvalho

11
2

Flink TaskManager Docker Swarm doesn't recover

1 Answers1