I cannot manage to make a TaskManager communicate with the JobManager on a docker swarm stack.
The content of the stack.yml
file I use to docker stack deploy
is:
version: "3"
services:
jobmanager:
image: affo/flink:1.1.3
ports:
- "48081:8081"
command: jobmanager
networks:
- my-net
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: none
placement:
constraints:
- node.role == manager
taskmanager:
image: affo/flink:1.1.3
depends_on:
- jobmanager
command: taskmanager
networks:
- my-net
deploy:
mode: replicated
replicas: 4
restart_policy:
condition: none
placement:
constraints:
- node.role != manager
networks:
my-net:
external: true
Docker image affo/flink:1.1.3
is a push on dockerhub of the image built following the README @ https://github.com/apache/flink/tree/release-1.1.3-rc2/flink-contrib/docker-flink.
Network my-net
is an overlay attachable network.
I tried to ping every container from others using DNS resolution and everything works correctly.
However no TaskManager can make it through to the JobManager.
I report the JobManager log: http://pastebin.com/Ai5s4Xvr
And the log of one TaskManager: http://pastebin.com/ty5pZhSp
The JM has VIP 10.0.42.7. And jobmanager.rpc.address
is set to jobmanager
which resolves to 10.0.42.7.
Any help or hint on where to start solving the problem would be appreciated.
Thanks a lot!
UPDATE
I add the output of docker exec <jobmanager> netstat -tulpn
:
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.11:40762 0.0.0.0:* LISTEN -
tcp 0 0 ::ffff:10.0.42.7:6123 :::* LISTEN 218/java
tcp 0 0 :::8081 :::* LISTEN 218/java
tcp 0 0 :::34963 :::* LISTEN 218/java
udp 0 0 127.0.0.11:57000 0.0.0.0:* -
And of docker exec <a_taskmanager> telnet jobmanager 6123
:
telnet: can't connect to remote host (10.0.42.7): Connection refused
I also put a link to a maybe-related issue on github: https://github.com/docker/docker/issues/28795.
Thanks again
UPDATE
I recently managed to change the jobmanager.rpc.address
to 0.0.0.0
only at the JobManager's and now it is effectively listening:
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.11:56218 0.0.0.0:* LISTEN -
tcp 0 0 :::6123 :::* LISTEN 218/java
tcp 0 0 :::8081 :::* LISTEN 218/java
tcp 0 0 :::55231 :::* LISTEN 218/java
udp 0 0 127.0.0.11:47549 0.0.0.0:* -
I can even nc
or telnet
from TaskManagers.
However, now the problem is (on the JobManager):
2017-02-09 10:31:20,794 ERROR akka.remote.EndpointWriter
- dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient
[Actor[akka.tcp://flink@10.0.42.7:6123/]] arriving at [akka.tcp://flink@10.0.42.7:6123]
inbound addresses are [akka.tcp://flink@0.0.0.0:6123]
Any help would be appreciated, thank you!
UPDATE
I think I isolated the problem. Issue opened on github: https://github.com/docker/docker/issues/30874