We have spark jobs run on dataproc cluster with yarn - we also have a wrapper program in python that does constant polling for the job's status and we are monitoring the job state from yarn - as shown as follows:
dataproc = discovery.build('dataproc', 'v1', credentials=credentials)
job_id = '8873a82c-6201-48d4-8ad3-d8f236ef9c49'
projectId='dev-111111'
REGION = 'global'
result = dataproc.projects().regions().jobs().get(projectId=projectId,region=REGION,jobId=job_id).execute()
print result['yarnApplications'][0]['state']
As suggested by google dataproc's doc here
The "result" above is a JSON object and within the JSON object there is a field called "yarnApplications", which is a list object whose first and only element contains the job state that we are interested in.
Question is - why this "yarnApplications" object always a list object even when we only have one yarn job running? We've seen situations where yarn is doing multiple attempts to launch a job - will the "yarnApplications" field contain multiple elements in this situation?
Also, is it guaranteed that - if we only have one job running on yarn, "yarnApplications" list object will only contain one element?
We understand that this is only a beta version for dataproc client - but as we have production system running on it so would appreciate any input and suggestions.
Thanks