TFX/Apache Beam -> Flink jobs hang when running on more than one task manager

Question

When I am trying to run a TFX pipeline/Apache Beam job on a Flink runner, it works fine when using 1 task manager (on one node) with parallelism 2 (2 task slots per task manager). But hangs when I try it with higher parallelism on more than one task manager with the message constantly repeating on both task managers:

INFO org.apache.beam.runners.fnexecution.environment.ExternalEnvironmentFactory [] - Still waiting for startup of environment from a65a0c5f8f962428897aac40763e57b0-1334930809.eu-central-1.elb.amazonaws.com:50000 for worker id 1-1

The Flink cluster runs on a native Kubernetes deployment on an AWS EKS Kubernetes Cluster.

I use the following parameters:

        "--runner=FlinkRunner",
        "--parallelism=4",
        f"--flink_master={flink_url}:8081",
        "--environment_type=EXTERNAL",
        f"--environment_config={beam_sdk_url}:50000",
        "--flink_submit_uber_jar",
        "--worker_harness_container_image=none",

EDIT: Adding additional info about the configuratio

I have configured the Beam workers to run as side-cars (at least this is my understanding of how it should work), by setting the Flink parameter:

kubernetes.pod-template-file.taskmanager

it is pointing out to a template file with contents:

kind: Pod
metadata:
  name: taskmanager-pod-template
spec:
     #hostNetwork: true
     containers:
      - name: flink-main-container
        #image: apache/flink:scala_2.12
        env:
          - name: AWS_REGION
            value: "eu-central-1"
          - name: S3_VERIFY_SSL
            value: "0"
          - name: PYTHONPATH
            value: "/data/flink/src"
        args: ["taskmanager"]
        ports:
        - containerPort: 6122 #22
          name: rpc
        - containerPort: 6125
          name: query-state
        livenessProbe:
          tcpSocket:
            port: 6122 #22
          initialDelaySeconds: 30
          periodSeconds: 60
      - name: beam-worker-pool
        env:
          - name: PYTHONPATH
            value: "/data/flink/src"
          - name: AWS_REGION
            value: "eu-central-1"
          - name: S3_VERIFY_SSL
            value: "0"
        image: 848221505146.dkr.ecr.eu-central-1.amazonaws.com/flink-workers
        imagePullPolicy: Always
        args: ["--worker_pool"]
        ports:
        - containerPort: 50000
          name: pool
        livenessProbe:
          tcpSocket:
            port: 50000
          initialDelaySeconds: 30
          periodSeconds: 60

I have also created a kubernetes load balancer for the task managers, so clients can connect on port 50000. So I use that address when configuring:

f"--environment_config={beam_sdk_url}:50000",

EDIT 2: Looks like the Beam SDK harness on one task manager wants to connect to the endpoint running on the other task manager, but looks for it on localhost:

Log from beam-worker-pool on TM 2:

2021/08/11 09:43:16 Failed to obtain provisioning information: failed to dial server at localhost:33705
    caused by:
context deadline exceeded

The provision endpoint on TM 1 is the one actually listening on the port 33705, while this is looking for it on localhost, so cannot connect to it.

EDIT 3: Showing how I test this:

...............

TM 1:
========
$ kubectl logs my-first-flink-cluster-taskmanager-1-1 -c beam-worker-pool
2021/08/12 09:10:34 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=localhost:33383', '--artifact_endpoint=localhost:43477', '--provision_endpoint=localhost:40983', '--control_endpoint=localhost:34793']
2021/08/12 09:13:05 Failed to obtain provisioning information: failed to dial server at localhost:40983
    caused by:
context deadline exceeded

TM 2:
=========
$ kubectl logs my-first-flink-cluster-taskmanager-1-2 -c beam-worker-pool
2021/08/12 09:10:33 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=localhost:40497', '--artifact_endpoint=localhost:36245', '--provision_endpoint=localhost:32907', '--control_endpoint=localhost:46083']
2021/08/12 09:13:09 Failed to obtain provisioning information: failed to dial server at localhost:32907
    caused by:
context deadline exceeded

Testing:
.........................

TM 1:
============
$ kubectl exec -it my-first-flink-cluster-taskmanager-1-1 -c beam-worker-pool -- bash
root@my-first-flink-cluster-taskmanager-1-1:/# curl localhost:40983
curl: (7) Failed to connect to localhost port 40983: Connection refused

root@my-first-flink-cluster-taskmanager-1-1:/# curl localhost:32907
Warning: Binary output can mess up your terminal. Use "--output -" to ...


TM 2:
=============
root@my-first-flink-cluster-taskmanager-1-2:/# curl localhost:32907
curl: (7) Failed to connect to localhost port 32907: Connection refused

root@my-first-flink-cluster-taskmanager-1-2:/# curl localhost:40983
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output

Not sure how to fix this.

Thanks, Gorjan

Just to clarify, when you run with 1 task manager you're still connecting to the same endpoint for your external environment, right? It sounds like there's some kind of conflict caused by both task managers trying to create an external environment at that endpoint ([source code](https://github.com/apache/beam/blob/v2.30.0/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ExternalEnvironmentFactory.java#L91)) but I'm not sure if that's due to a known limitation, a bug, or some sort of user error. — Daniel Oliveira, Aug 11 '21 at 00:28
Yes, @DanielOliveira -running with one task manager I use the same endpoint. I have edited my question with a bit more details about the configuration. — Gorjan Todorovski, Aug 11 '21 at 06:43

score 0 · Answer 1 · answered Aug 11 '21 at 01:25

0

It's not recommended to try to connect to the same environment with different task managers. Usually we recommend setting up the Beam workers as side cars to the task managers so there's a 1:1 correspondence, then connecting via localhost. See the example config at https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/without_job_server/beam_flink_cluster.yaml and https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/without_job_server/beam_wordcount_py.yaml

answered Aug 11 '21 at 01:25

ibzib

414
2
6

I have already configured the Veam worker containers to run as side0cars to tasks manager containers (in the same pod) by setting the FLink parameter: `kubernetes.pod-template-file.taskmanager` Have now edited my question with more details around this. Please check the details of of that template file in the question. – Gorjan Todorovski Aug 11 '21 at 06:38
It seems the beam worker on one task manager expects to connect to the beam SDK harness endpoint on the other task manager, not the one it runs on, as I see this in the beam-worker-pool log from task manager 2: ```Starting worker with command ['/opt/apache/beam/boot', '--id=1-2', '--logging_endpoint=localhost:38669', '--artifact_endpoint=localhost:38211', '--provision_endpoint=localhost:33705', '--control_endpoint=localhost:33673'] 2021/08/11 09:43:16 Failed to obtain provisioning information: failed to dial server at localhost:33705 caused by: context deadline exceeded ``` – Gorjan Todorovski Aug 11 '21 at 10:53
These endpoints (for communication between the task manager and the Beam worker) are set up on localhost, and there's currently no way to configure them. – ibzib Aug 11 '21 at 20:29
Note that Beam worker IDs are unrelated to Flink task manager IDs, so the problem is not that task managers are trying to connect to the wrong worker. – ibzib Aug 11 '21 at 20:32
This is what is confusing me: – Gorjan Todorovski Aug 12 '21 at 09:26
Connected to TM1, I am only able to connect to an endpoint which I see in the logs as being started on TM2. Please check EDIT3 in my question, where I show this. – Gorjan Todorovski Aug 12 '21 at 09:35
And vice versa, connected to TM2, I am only able to connect to an endpoint which I see in the logs as being started on TM1. – Gorjan Todorovski Aug 12 '21 at 09:35
Can you try removing the load balancer? Beam/Flink should be able to do its own load balancing. – ibzib Aug 17 '21 at 22:35
Yes, I was able to fix this by setting the Beam SDK address to localhost instead of using a load balancer. – Gorjan Todorovski Aug 19 '21 at 11:43

score 0 · Accepted Answer · answered Aug 19 '21 at 11:45

I was able to fix this by setting the Beam SDK address to localhost instead of using a load balancer. So the config I use now is:

        "--runner=FlinkRunner",
        "--parallelism=4",
        f"--flink_master={flink_url}:8081",
        "--environment_type=EXTERNAL",
        "--environment_config=localhost:50000", # <--- Changed the address to localhost
        "--flink_submit_uber_jar",
        "--worker_harness_container_image=none",

TFX/Apache Beam -> Flink jobs hang when running on more than one task manager

2 Answers2