1

I am trying to launch a Dataflow flex template. As part of the build and deploy process, I am pre-building a custom SDK container image to reduce worker start-up time.

I have attempted this in these ways:

  1. When no sdk_container_image is specified and a requirements.txt file is provided, a Dataflow flex template is successfully launched and a graph is built, but the workers cannot start because they lack authority to install private packages.
  • Errors:
    • Dataflow job appears to be stuck; no worker activity
    • Workers fail to install requirements
  1. When an sdk_container_image is given (with dependencies pre-installed), the Dataflow job starts, but instead of running the pipeline on the same job, it attempts to launch a Dataflow job for the pipeline using the same name, which results in an error.
  • Errors:
    • Duplicate Dataflow job name, DataflowJobAlreadyExistsError
  1. If I attempt to pass a second job_name for the pipeline, the pipeline successfully starts in a separate job, but the original flex template job eventually fails due to a polling timeout. Pipeline fails at very last step due to sdk harness disconnected because "The worker VM had to shut down one or more processes due to lack of memory."
  • Template errors:
    • Timeout in polling result file
  • Pipeline Errors:
    • SDK harness disconnected (Out of memory)
  1. When I launch the pipeline locally using DataflowRunner, the pipeline runs successfully under one job name.

Here are my Dockerfiles and gcloud commands:

Flex template Dockerfile:

FROM gcr.io/dataflow-templates-base/python39-template-launcher-base

# Create working directory
ARG WORKDIR=/flex
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
#   https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
RUN apt-get update && apt-get install -y libffi-dev && rm -rf /var/lib/apt/lists/*

COPY ./ ./

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/launch_pipeline.py"

# Install the pipeline dependencies
RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir apache-beam[gcp]==2.41.0
RUN pip install --no-cache-dir -r requirements.txt

ENTRYPOINT [ "/opt/google/dataflow/python_template_launcher" ]

Worker Dockerfile:

# Set up image for worker.
FROM apache/beam_python3.9_sdk:2.41.0

WORKDIR /worker

COPY ./requirements.txt ./

RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir -r requirements.txt

Building template:

gcloud dataflow flex-template build $TEMPLATE_LOCATION \
    --image "$IMAGE_LOCATION" \
    --sdk-language "PYTHON" \
    --metadata-file "metadata.json"

Launching template:

gcloud dataflow flex-template run ddjanke-local-flex \
    --template-file-gcs-location=$TEMPLATE_LOCATION \
    --project=$PROJECT \
    --service-account-email=$EMAIL \
    --parameters=[OTHER_ARGS...],sdk_container_image=$WORKER_IMAGE \
    --additional-experiments=use_runner_v2
ddjanke
  • 121
  • 4

1 Answers1

1

I actually managed to solve this yesterday. The problem was that I was passing sdk_container_image to Dataflow through the flex template and then passing that through to the PipelineOptions within my code. After I removed sdk_container_image from the options, it launched the pipeline in the same job.

ddjanke
  • 121
  • 4