3

After running a deployment script to launch a dataflow flex job, I get

"failed to read the job file : gs://dataflow-staging-europe-west2/------/staging/template_launches/{JOBNAME}/job_object with error message: (7ea9e263ad5cddb5): Unable to open template file: gs://dataflow-staging-europe-west2-644733586574/staging/template_launches/{JOBNAME}/job_object..

Console logs says "Template launch successful", and there is no pythonic error in the cloud build logs.

Here is what the main structure of my Python code of my cloud storage to parse csv files, perform some transformations/computation on the raw data and then create datastore entities pipeline looks like:

file-structure:

    ├── pipeline
    │   ├── runner.py
    │   ├── setup.py
    │   ├── ingestion
    │   │   ├── transformer.py
    │   │   ├── custom.py

image of files in editor: 1

dockerfile

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
 
RUN apt-get update
# Upgrade pip and install the requirements.
RUN pip3 install --no-cache-dir --upgrade pip
RUN pip3 install apache-beam==2.35.0
RUN pip3 install google-cloud-logging
 
WORKDIR /
RUN mkdir -p /dataflow/template
WORKDIR /dataflow/template

COPY ingestion ${WORKDIR}/ingestion
COPY setup.py ${WORKDIR}/setup.py
COPY runner.py ${WORKDIR}/runner.py
 
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/runner.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
 
# Since we already downloaded all the dependencies, there's no need to rebuild everything.
ENV PIP_NO_DEPS=True
# std libs
import os

import logging
import datetime

# helper modules
from ingestion.all_settings import *
from ingestion.avg_helpers import *
from ingestion.transform import *

# Data-flow modules
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.gcp.datastore.v1new.datastoreio import WriteToDatastore


# MAIN function, to run dataflow pipeline module
def dataflow():
    JOB_NAME = f"datastore-upload-{datetime.datetime.now().strftime('%Y-%m-%d-%H%M%S')}"

    #wildcard expression for file storage bucket containing the subject's data 
    file_ex = [gs//bucket-example-csv-file]

    #variable to store pipeline options to be passed into beam function later
    pipeline_options = {
        'runner': 'DirectRunner',
        'project': PROJECT,
        'region': 'europe-west-b',
        'job_name': JOB_NAME,
        'staging_location': TEST_BUCKET + '/staging',
        'temp_location': TEST_BUCKET + '/temp',
        'save_main_session': False,
        'streaming': False,
        'setup_file': '/dataflow/template/setup.py',
    }

    options = PipelineOptions.from_dictionary(pipeline_options)
    with beam.Pipeline(options=options) as p:
        for i,filename in enumerate(file_ex):
            (p 
            | 'Reading input files' >> beam.io.ReadFromText(filename, skip_header_lines = 1)
            | 'Converting from csv to dict' >> beam.ParDo(ProcessCSV(), harvard_medical_headers)
            | 'Create entities for minute averages' >> beam.ParDo(BuildMinuteEntities(),filename)
            | 'Write entities into Datastore' >> WriteToDatastore(PROJECT)
            )
            p.run().wait_until_finish()


if __name__ == '__main__':
    dataflow()

2 Answers2

1

probably need to mention the setup file name in your beam options:

...
    #variable to store pipeline options to be passed into beam function later
    pipeline_options = {
        'runner': 'DirectRunner',
        'project': PROJECT,
        'region': 'europe-west-b',
        'job_name': JOB_NAME,
        'staging_location': TEST_BUCKET + '/staging',
        'temp_location': TEST_BUCKET + '/temp',
        'save_main_session': False,
        'streaming': False,
        'setup_file'='/dataflow/template/setup.py',
    }
eshirvana
  • 23,227
  • 3
  • 22
  • 38
1

In my case it was caused by the pipeline option 'runner': 'DirectRunner'. If you want to launch Dataflow job on the Dataflow engine, you should put there 'runner': 'DataflowRunner'.

ImustAdmit
  • 388
  • 4
  • 14