3

I am currently working on a Dataflow Template in Python, and I would like to access the Job ID and use it to save to a specific Firestore Document.

Is it possible to access the Job ID?

I cannot find anything regarding this in the documentation.

jmoore255
  • 321
  • 4
  • 15

5 Answers5

6

You can do so by calling dataflow.projects().locations().jobs().list from within the pipeline (see full code below). One possibility is to always invoke the template with the same job name, which would make sense, otherwise the job prefix could be passed as a runtime parameter. The list of jobs is parsed applying a regex to see if the job contains the name prefix and, if so, returns the job ID. In case there are more than one it will only return the latest one (which is the one currently running).

The template is staged, after defining the PROJECT and BUCKET variables, with:

python script.py \
    --runner DataflowRunner \
    --project $PROJECT \
    --staging_location gs://$BUCKET/staging \
    --temp_location gs://$BUCKET/temp \
    --template_location gs://$BUCKET/templates/retrieve_job_id

Then, specify the desired job name (myjobprefix in my case) when executing the templated job:

gcloud dataflow jobs run myjobprefix \
   --gcs-location gs://$BUCKET/templates/retrieve_job_id

The retrieve_job_id function will return the job ID from within the job, change the job_prefix to match the name given.

import argparse, logging, re
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions


def retrieve_job_id(element):
  project = 'PROJECT_ID'
  job_prefix = "myjobprefix"
  location = 'us-central1'

  logging.info("Looking for jobs with prefix {} in region {}...".format(job_prefix, location))

  try:
    credentials = GoogleCredentials.get_application_default()
    dataflow = build('dataflow', 'v1b3', credentials=credentials)

    result = dataflow.projects().locations().jobs().list(
      projectId=project,
      location=location,
    ).execute()

    job_id = "none"

    for job in result['jobs']:
      if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
        job_id = job['id']
        break

    logging.info("Job ID: {}".format(job_id))
    return job_id

  except Exception as e:
    logging.info("Error retrieving Job ID")
    raise KeyError(e)


def run(argv=None):
  parser = argparse.ArgumentParser()
  known_args, pipeline_args = parser.parse_known_args(argv)

  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = True

  p = beam.Pipeline(options=pipeline_options)

  init_data = (p
               | 'Start' >> beam.Create(["Init pipeline"])
               | 'Retrieve Job ID' >> beam.FlatMap(retrieve_job_id))

  p.run()


if __name__ == '__main__':
  run()
Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • Thanks a lot @Guillem Xercavins, that worked perfectly. – jmoore255 Sep 19 '18 at 15:08
  • 2
    This method while works on a small scale will have its challenges if we are running 2 jobs with a similar prefix. For instance we generally create the job name as PREFIX+TIMESTAMP. This helps us run two or more jobs parallely. However if we use this solutions technique we cannot differentiate between 2 jobs that have the same prefix. – Raj Oberoi Apr 23 '21 at 00:55
2

You can use the Google Dataflow API. Use the projects.jobs.list method to retrieve Dataflow Job IDs.

Yurci
  • 556
  • 2
  • 9
0

From skimming over the documentation, the response you should get from launching the job should contain a json body with a property "job" that is an instance of Job.

You should be able to use this to get the Id you need.

If you are using the google cloud sdk for dataflow, you might get a different object when you call the create method on templates().

Carlo Field
  • 644
  • 4
  • 10
  • Thanks @Carlo Field, I can see the job_id when I launch the template, although I need the job_id to be retrieved automatically within the template, to use for naming a document in firestore in which I save information to from the pipeline. – jmoore255 Sep 18 '18 at 15:33
0

The following snippet launches a Dataflow template stored in a GCS bucket, gets the job id from the response body of the launch template API, and finally polls for the final job state of the Dataflow Job every 10 seconds, for example.

The official documentation by Google Cloud for the response body is here.

So far I have only seen six job states of a Dataflow Job, please let me know if I have missed the others.

def launch_dataflow_template(project_id, location, credentials, template_path):
    dataflow = googleapiclient.discovery.build('dataflow', 'v1b3', credentials=credentials)
    logger.info(f"Template path: {template_path}")
    result = dataflow.projects().locations().templates().launch(
            projectId=project_id,
            location=location,
            body={
                ...
            },
            gcsPath=template_path  # dataflow template path
    ).execute()
    return result.get('job', {}).get('id')

def poll_dataflow_job_status(project_id, location, credentials, job_id):
    dataflow = googleapiclient.discovery.build('dataflow', 'v1b3', credentials=credentials)
    # executing states are not the final states of a Dataflow job, they show that the Job is transitioning into another upcoming state
    executing_states = ['JOB_STATE_PENDING', 'JOB_STATE_RUNNING', 'JOB_STATE_CANCELLING']
    # final states do not change further
    final_states = ['JOB_STATE_DONE', 'JOB_STATE_FAILED', 'JOB_STATE_CANCELLED']
    while True:
        job_desc =_get_dataflow_job_status(dataflow, project_id, location, job_id)
        if job_desc['currentState'] in executing_states:
            pass
        elif job_desc['currentState'] in final_states:
            break
        sleep(10)
    return job_id, job_desc['currentState']
gamberooni
  • 51
  • 2
  • Thanks for detailed code snippet. In what library I can find method `_get_dataflow_job_status` , I could not find it anywhere. – Dinesh Aug 19 '22 at 14:30
0

You can get gcp metadata from using these beam functions in 2.35.0. You can visit documentation https://beam.apache.org/releases/pydoc/2.35.0/_modules/apache_beam/io/gcp/gce_metadata_util.html#fetch_dataflow_job_id

beam.io.gcp.gce_metadata_util._fetch_custom_gce_metadata("job_name")
beam.io.gcp.gce_metadata_util._fetch_custom_gce_metadata("job_id")