4

I cannot manage to make a TaskManager communicate with the JobManager on a docker swarm stack.

The content of the stack.yml file I use to docker stack deploy is:

version: "3"
services:
  jobmanager:
    image: affo/flink:1.1.3
    ports:
      - "48081:8081"
    command: jobmanager
    networks:
      - my-net
    deploy:
        mode: replicated
        replicas: 1
        restart_policy:
            condition: none
        placement:
            constraints:
                - node.role == manager

  taskmanager:
    image: affo/flink:1.1.3
    depends_on:
      - jobmanager
    command: taskmanager
    networks:
      - my-net
    deploy:
        mode: replicated
        replicas: 4
        restart_policy:
            condition: none
        placement:
            constraints:
                - node.role != manager

networks:
    my-net:
        external: true

Docker image affo/flink:1.1.3 is a push on dockerhub of the image built following the README @ https://github.com/apache/flink/tree/release-1.1.3-rc2/flink-contrib/docker-flink.

Network my-net is an overlay attachable network.

I tried to ping every container from others using DNS resolution and everything works correctly.

However no TaskManager can make it through to the JobManager.

I report the JobManager log: http://pastebin.com/Ai5s4Xvr

And the log of one TaskManager: http://pastebin.com/ty5pZhSp

The JM has VIP 10.0.42.7. And jobmanager.rpc.address is set to jobmanager which resolves to 10.0.42.7.

Any help or hint on where to start solving the problem would be appreciated.

Thanks a lot!

UPDATE

I add the output of docker exec <jobmanager> netstat -tulpn:

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.11:40762        0.0.0.0:*               LISTEN      -
tcp        0      0 ::ffff:10.0.42.7:6123   :::*                    LISTEN      218/java
tcp        0      0 :::8081                 :::*                    LISTEN      218/java
tcp        0      0 :::34963                :::*                    LISTEN      218/java
udp        0      0 127.0.0.11:57000        0.0.0.0:*                           -

And of docker exec <a_taskmanager> telnet jobmanager 6123:

telnet: can't connect to remote host (10.0.42.7): Connection refused

I also put a link to a maybe-related issue on github: https://github.com/docker/docker/issues/28795.

Thanks again

UPDATE

I recently managed to change the jobmanager.rpc.address to 0.0.0.0 only at the JobManager's and now it is effectively listening:

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.11:56218        0.0.0.0:*               LISTEN      -
tcp        0      0 :::6123                 :::*                    LISTEN      218/java
tcp        0      0 :::8081                 :::*                    LISTEN      218/java
tcp        0      0 :::55231                :::*                    LISTEN      218/java
udp        0      0 127.0.0.11:47549        0.0.0.0:*                           -

I can even nc or telnet from TaskManagers.

However, now the problem is (on the JobManager):

2017-02-09 10:31:20,794 ERROR akka.remote.EndpointWriter                
- dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient
[Actor[akka.tcp://flink@10.0.42.7:6123/]] arriving at [akka.tcp://flink@10.0.42.7:6123]
inbound addresses are [akka.tcp://flink@0.0.0.0:6123]

Any help would be appreciated, thank you!

UPDATE

I think I isolated the problem. Issue opened on github: https://github.com/docker/docker/issues/30874

affo
  • 453
  • 3
  • 15

1 Answers1

2

If you follow the issue opened on Github, you can understand that the real problem was in swarm native networking VIP assignment. I turned it off and everything works now.

Actually, there is no way up to now to turn it off from compose file, so, I had to switch to a scripted deploy rather than an automatic docker stack deploy.

affo
  • 453
  • 3
  • 15