3

I have submitted a Hive job using Airflow's DataprocWorkflowTemplateInstantiateInlineOperator to Dataproc cluster. When some of the jobs fail in googlecloud->dataproc->jobs I can see a link to the log with failure:

Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found in 'gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput'

Can I fetch this log link (e.g. gs://dataproc-abcde12-efghi23-jklmn12-uk/google-cloud-dataproc-metainfo/12354a681fgh161/jobs/job1-abdc12jssa/driveroutput) through Airflow?

I checked gcp_dataproc_hook.py operator for anything that points to a log link so that I can retrieve it, but didn't find anything useful.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
saicharan
  • 435
  • 6
  • 18
  • 2
    It appears this is already being logged: https://github.com/apache/airflow/blob/master/airflow/contrib/hooks/gcp_dataproc_hook.py#L53 – tix Feb 12 '19 at 22:22

1 Answers1

1

Looks like there's no auto-created handy link to fetch the output in Airflow's logs yet, but it could certainly be added (if you're feeling bold, could be worth sending a pull request to Airflow yourself! Or otherwise filing a feature request https://issues.apache.org/jira/browse/AIRFLOW).

In general you can construct a handy URL or a copy/pasteable CLI command given the jobid; if you want to use Dataproc's UI directly, simply construct a URL of the form:

https://cloud.google.com/console/dataproc/jobs/%s/?project=%s&region=%s

with params

jobId, projectId, region

Alternatively, you could type:

gcloud dataproc jobs wait ${JOBID} --project ${PROJECTID} --region ${REGION}

A more direct approach with the URI would be:

gsutil cat ${LOG_LINK}*

with a glob expression at the end of that URL (it's not just a single file, it's a set of files).

Dennis Huo
  • 10,517
  • 27
  • 43