I am trying to launch a Dataflow flex template. As part of the build and deploy process, I am pre-building a custom SDK container image to reduce worker start-up time.
I have attempted this in these ways:
- When no
sdk_container_image
is specified and a requirements.txt file is provided, a Dataflow flex template is successfully launched and a graph is built, but the workers cannot start because they lack authority to install private packages.
- Errors:
- Dataflow job appears to be stuck; no worker activity
- Workers fail to install requirements
- When an
sdk_container_image
is given (with dependencies pre-installed), the Dataflow job starts, but instead of running the pipeline on the same job, it attempts to launch a Dataflow job for the pipeline using the same name, which results in an error.
- Errors:
- Duplicate Dataflow job name, DataflowJobAlreadyExistsError
- If I attempt to pass a second job_name for the pipeline, the pipeline successfully starts in a separate job, but the original flex template job eventually fails due to a polling timeout. Pipeline fails at very last step due to sdk harness disconnected because "The worker VM had to shut down one or more processes due to lack of memory."
- Template errors:
- Timeout in polling result file
- Pipeline Errors:
- SDK harness disconnected (Out of memory)
- When I launch the pipeline locally using DataflowRunner, the pipeline runs successfully under one job name.
Here are my Dockerfiles and gcloud
commands:
Flex template Dockerfile:
FROM gcr.io/dataflow-templates-base/python39-template-launcher-base
# Create working directory
ARG WORKDIR=/flex
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
RUN apt-get update && apt-get install -y libffi-dev && rm -rf /var/lib/apt/lists/*
COPY ./ ./
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/launch_pipeline.py"
# Install the pipeline dependencies
RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir apache-beam[gcp]==2.41.0
RUN pip install --no-cache-dir -r requirements.txt
ENTRYPOINT [ "/opt/google/dataflow/python_template_launcher" ]
Worker Dockerfile:
# Set up image for worker.
FROM apache/beam_python3.9_sdk:2.41.0
WORKDIR /worker
COPY ./requirements.txt ./
RUN pip install --no-cache-dir --upgrade pip setuptools wheel
RUN pip install --no-cache-dir -r requirements.txt
Building template:
gcloud dataflow flex-template build $TEMPLATE_LOCATION \
--image "$IMAGE_LOCATION" \
--sdk-language "PYTHON" \
--metadata-file "metadata.json"
Launching template:
gcloud dataflow flex-template run ddjanke-local-flex \
--template-file-gcs-location=$TEMPLATE_LOCATION \
--project=$PROJECT \
--service-account-email=$EMAIL \
--parameters=[OTHER_ARGS...],sdk_container_image=$WORKER_IMAGE \
--additional-experiments=use_runner_v2