0

We have spark jobs run on dataproc cluster with yarn - we also have a wrapper program in python that does constant polling for the job's status and we are monitoring the job state from yarn - as shown as follows:

dataproc = discovery.build('dataproc', 'v1', credentials=credentials)
job_id = '8873a82c-6201-48d4-8ad3-d8f236ef9c49'
projectId='dev-111111'
REGION = 'global'

result = dataproc.projects().regions().jobs().get(projectId=projectId,region=REGION,jobId=job_id).execute()

print result['yarnApplications'][0]['state']

As suggested by google dataproc's doc here

The "result" above is a JSON object and within the JSON object there is a field called "yarnApplications", which is a list object whose first and only element contains the job state that we are interested in.

Question is - why this "yarnApplications" object always a list object even when we only have one yarn job running? We've seen situations where yarn is doing multiple attempts to launch a job - will the "yarnApplications" field contain multiple elements in this situation?

Also, is it guaranteed that - if we only have one job running on yarn, "yarnApplications" list object will only contain one element?

We understand that this is only a beta version for dataproc client - but as we have production system running on it so would appreciate any input and suggestions.

Thanks

Howard Xie
  • 113
  • 5

1 Answers1

1

Per the Dataproc API job definition, jobs contain a "collection" of YarnApplications, and in general this definition can't change it's type based on the runtime contents. For example, the Java interface for Job.getYarnApplications() returns a java.util.List, regardless of whether the list only happens to have one element or zero or many.

This API definition was designed to accommodate various job types which may submit multiple YARN applications per job, such has Hive or Pig. In some cases Hadoop jarfiles also submit multiple jobs, for example if the driver program is an Apache Crunch program, or you run Gridmix.

You are indeed guaranteed that that if you only have one job running in YARN, then the list object will only contain one element; the YARN applications in the list are only the ones created by the given Job invocation. Even if you run multiple Dataproc jobs at the same time with each one submitting different concurrent YARN applications, each job will only contain the particular YARN application(s) the job submitted itself.

Dennis Huo
  • 10,517
  • 27
  • 43
  • Thank you very much Dennis, appreciate your comment. One question thou - you mentioned that "You are indeed guaranteed that that if you only have one job running in YARN..." - is this from documentation somewhere or your observations? If it's from a documentation, do you mind share it please? Thanks again! – Howard Xie Jan 23 '17 at 15:08
  • In the situation you described where "yarn is doing multiple attempts", in YARN that remains under a single YARN application ID, and the multiple attempts are listed inside that; "yarn applications -list -appStates ALL" would still only return the one application ID for all attempts. So Dataproc isn't promising anything special about only surfacing one YARN application when only one application is running; rather, Dataproc tracks YARN applications using jobid tags. If your driver program itself retries issuing YARN applications, then Dataproc would indeed show multiple apps in a job. – Dennis Huo Jan 24 '17 at 02:28
  • The documentation for now is just inside the [Job resource definition](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#resource-job), but doesn't go into depth about how retries might affect the number of YARN applications listed inside the job; that behavior may differ between different engines. In the end, Dataproc's listing of YARN applications will contain similar contents as running `yarn application -list` in the cluster, but filtered by the jobid that spawned the application(s). – Dennis Huo Jan 24 '17 at 02:30