0

Running a pipeline on DataflowRunner (Google Cloud Dataflow SDK for Python 0.5.5).

The pipeline:

(p
    | 'Read trip from BigQuery' >> beam.io.Read(beam.io.BigQuerySource(query=known_args.input))
    | 'Convert' >> beam.Map(lambda row: (row['HardwareId'],row))
    | 'Group devices' >> beam.GroupByKey()
    | 'Pull way info from mapserver' >> beam.FlatMap(get_osm_way)
    | 'Map way info to dictionary' >> beam.FlatMap(convert_to_dict)
    | 'Save to BQ' >> beam.io.Write(beam.io.BigQuerySink(
            known_args.output,            schema=schema_string,
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
  )

It's set to be autoscaling and 15 workers were spinned up by the runner.

More detailed code: my another StackOverflow question

After around 2 hours of running, it reported:

19:41:19.908
Attempting refresh to obtain initial access_token
 {
 insertId: "jf9yr4g1sv0qku"   
 jsonPayload: {
  message: "Attempting refresh to obtain initial access_token"    
  worker: "beamapp-root-0216221014-5-02161410-29cb-harness-xqx2"    
  logger: "oauth2client.client:client.py:new_request"    
  thread: "110:140052132222720"    
  job: "2017-02-16_14_10_18-17481182243152998182"    
 }
 resource: {…}   
 timestamp: "2017-02-17T00:41:19.908143997Z"   
 severity: "INFO"   
 labels: {…}   
 logName: "projects/fiona-zhao/logs/dataflow.googleapis.com%2Fworker"   
}

and started continuously reporting "refreshing due to a 401" . One of them is:

21:45:12.886
Refreshing due to a 401 (attempt 1/2)
 {
 insertId: "zsorfgg1urhvty"   
 jsonPayload: {
  worker: "beamapp-root-0216221014-5-02161410-29cb-harness-xqx2"    
  logger: "oauth2client.client:client.py:new_request"    
  thread: "110:140052273633024"    
  job: "2017-02-16_14_10_18-17481182243152998182"    
  message: "Refreshing due to a 401 (attempt 1/2)"    
 }
 resource: {…}  
 timestamp: "2017-02-17T02:45:12.886137962Z"   
 severity: "INFO"   
 labels: {
  compute.googleapis.com/resource_name: "dataflow-beamapp-root-0216221014-5-02161410-29cb-harness-xqx2"    
  dataflow.googleapis.com/job_id: "2017-02-16_14_10_18-17481182243152998182"    
  dataflow.googleapis.com/job_name: "beamapp-root-0216221014-530646"    
  dataflow.googleapis.com/region: "global"    
  compute.googleapis.com/resource_type: "instance"    
  compute.googleapis.com/resource_id: "2301951363070532306"    
 }
 logName: "projects/fiona-zhao/logs/dataflow.googleapis.com%2Fworker"   
}

What can I do?

Community
  • 1
  • 1
foxwendy
  • 2,819
  • 2
  • 28
  • 50
  • Have you had success running this job before? It does seem like it was running for a long time. Is this unusual? – Pablo Feb 21 '17 at 19:52

1 Answers1

3

These log messages are a normal part of execution and in themselves do not reflect errors. My suggestion is to add additional logging to debug hanging external API calls or execution steps.

Though we cannot comment on specific execution details of particular jobs on this open forum, the Cloud Dataflow team can provide more support on the dataflow-feedback@google.com mailing list.

Charles Chen
  • 346
  • 1
  • 4