4

Trying to move from Flink 1.3.2 to 1.5 We have cluster deployed with kubernetes. Everything works fine with 1.3.2 but I can not submit job with 1.5. When I am trying to do that I just see spinner spin around infinitely, same via REST api. I even can't submit wordcount example job. Seems my taskmanagers can not connect to jobmanager, I can see them in flink UI, but in logs I see

level=WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123

level=WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@flink-jobmanager-nonprod-2.rpds.svc.cluster.local:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@flink-jobmanager-nonprod-2.rpds.svc.cluster.local:6123]] Caused by: [No response from remote for outbound association. Associate timed out after [20000 ms].]

level=WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123

But I can do telnet from taskmanager to jobmanager

Moreover everything works on my local if I start flink in cluster mode (jobmanager + taskmanager). In 1.5 documentation I found mode option which flip mode between flip6 and legacy (default flip6), but If I set mode: legacy I don't see my taskmanagers registered at all.

Is this something specific about k8s deployment and 1.5 I need to do? I checked 1.5 k8s config and it looks pretty same as we have, but we using customized docker image for flink (Security, HA, checkpointing)

Thank you.

Georgy Gobozov
  • 13,633
  • 8
  • 72
  • 78
  • I think you should check your dependency's consistency, one more time! – Soheil Pourbafrani Jun 11 '18 at 21:25
  • Jobs rebuilt with flink 1.5.0 dependencies mentioned here https://flink.apache.org/downloads.html That's what we put in lib folder aws-java-sdk-1.7.4.jar, flink-dist_2.11-1.5.0.jar, flink-metrics-datadog-1.5.0.jar, flink-python_2.11-1.5.0.jar, flink-shaded-hadoop2-uber-1.5.0.jar, hadoop-aws-2.7.2.jar , httpclient-4.5.3.jar, httpcore-4.4.4.jar, jackson-annotations-2.6.7.jar, jackson-core-2.6.7.jar, jackson-databind-2.6.7.jar, joda-time-2.8.1.jar, log4j-1.2.17.jar, slf4j-log4j12-1.7.7.jar – Georgy Gobozov Jun 11 '18 at 21:33
  • Could you share the full client and cluster entrypoint logs with us @GeorgyGobozov? I would also be helpful to see your K8s deployment and service definition. In order to submit a job with the client you need to expose the rest endpoint port (8081) and the blob server port as a `NodePort`. If you only want to use the web UI it should be enough to expose these ports as `ClusterIP` – Till Rohrmann Jun 12 '18 at 07:12
  • @TillRohrmann I am trying submit job from web UI only, at least now. Here is my k8s configs: https://pastebin.com/4W4KmvfR https://pastebin.com/1Rvd87Cc https://pastebin.com/Jd8mRXAH Switched to flink:latest images, but still getting issue with job submit. Trying to submit wordcount and getting on jobmanager "Could not connect to BlobServer at address flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.250.57:6124" "Caused by: java.net.ConnectException: Connection timed out" – Georgy Gobozov Jun 12 '18 at 21:09
  • Could you check whether `flink-jobmanager-nonprod-2` is reachable from the node on which the `JobManager` is deployed. There are some known problems of this kind with K8s: https://github.com/kubernetes/kubernetes/issues/20475, https://github.com/kubernetes/kubernetes/issues/19930 and https://github.com/kubernetes/kubernetes/issues/20391 – Till Rohrmann Jun 12 '18 at 21:33
  • @TillRohrmann I am not own k8s cluster, can not connect to node directly, but telnet to service works from all pods except jobmanager, same with 1.3.2 cluster, but no issue with 1.3.2. It was implemented different way in 1.3.2? I talked to our k8s team, they said that this is hairpin issue, whatever it means, seems call from pod to itself via service just hung. – Georgy Gobozov Jun 13 '18 at 00:05
  • I think the difference between `1.3.2` and `1.5.0` is that the former connects to `localhost` when trying to upload the user code jars from the `JarRunHandler` and `1.5.0` will connect against the K8s service. I think it would be good to resolve the K8s problem because it seems to be the root cause of the problem. – Till Rohrmann Jun 13 '18 at 06:50
  • @TillRohrmann added service hostname as 127.0.0.1 to jobmanager hosts file, works now. Thank you! – Georgy Gobozov Jun 15 '18 at 21:09
  • Good to hear @GeorgyGobozov :-) – Till Rohrmann Jun 16 '18 at 06:26
  • @TillRohrmann, How did you added service hostname to jobmanager? Did you create new flink docker image? – Aleksandr Filichkin Jul 24 '18 at 10:54
  • See the same issue with 1.5.1 in Minikube, but it works fine with 1.4.2 – Aleksandr Filichkin Jul 24 '18 at 11:08
  • @AleksandrFilichkin you need to create a new image (ideally derived from the existing image) which sets the job manager rpc address to the service name. – Till Rohrmann Jul 24 '18 at 12:22
  • @TillRohrmann Do you mean k8s job-manager service name? – Aleksandr Filichkin Jul 24 '18 at 12:33
  • Yes, the name of the service which you start to make the job manager reachable from other pods. – Till Rohrmann Jul 24 '18 at 13:26
  • I see a similar issue but for me it looks like the TM is not able to be found. I'm assuming it is related, but I do not know how to add the TM name to /etc/hosts since the name is the pod name which is ephemeral. – victtim Jan 29 '19 at 01:07

1 Answers1

1

The issue with jobmanage connectivity. Jobmanager docker image cannot connect to "flink-jobmanager" (${JOB_MANAGER_RPC_ADDRESS}) address.

Just use afilichkin/flink-k8s Docker instead of flink:latest

I've fixed it by adding new host to jobmanager docker. You can see it in my github project

https://github.com/Aleksandr-Filichkin/flink-k8s/tree/master

Aleksandr Filichkin
  • 660
  • 2
  • 8
  • 22