Dataflow SDK version

Question

I'm running into a problem testing out Dataflow by running code like this from a Datalab cell.

import apache_beam as beam

# Pipeline options:
options                         = beam.options.pipeline_options.PipelineOptions()
gcloud_options                  = options.view_as(beam.options.pipeline_options.GoogleCloudOptions)
gcloud_options.job_name         = 'test002'
gcloud_options.project          = 'proj'
gcloud_options.staging_location = 'gs://staging'
gcloud_options.temp_location    = 'gs://tmp'
# gcloud_options.region           = 'europe-west2'

# Worker options:
worker_options                  = options.view_as(beam.options.pipeline_options.WorkerOptions)
worker_options.disk_size_gb     = 30
worker_options.max_num_workers  = 10

# Standard options:
options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DataflowRunner'
# options.view_as(beam.options.pipeline_options.StandardOptions).runner = 'DirectRunner'

# Pipeline:

PL = beam.Pipeline(options=options)

query = 'SELECT * FROM [bigquery-public-data:samples.shakespeare] LIMIT 10'    
(
    PL | 'read'  >> beam.io.Read(beam.io.BigQuerySource(project='project', use_standard_sql=False, query=query))
       | 'write' >> beam.io.WriteToText('gs://test/test2.txt', num_shards=1)
)

PL.run()

print "Complete"

There has been various successful attempts and a few that have failed. This is fine and understood but what I don't understand is what I have done to change the SDK version from 2.9.0 to 2.0.0, as shown below. Could anyone point out what I've done and how to move back up to SDK version 2.9.0 please?

score 1 · Accepted Answer · answered Jan 15 '19 at 15:32

You can check which SDK version you'll use by running:

!pip freeze | grep beam

In your case this should return:

apache-beam==2.0.0

And just force the desired version (i.e. 2.9.0) by adding a cell on top:

!pip install apache-beam[gcp]==2.9.0

If you already submitted a job you might need to restart the kernel (reset session) for the change to take effect. There is a one-day difference between the jobs with different SDKs so my guess is that you or someone else changed the dependencies (assuming those were run from the same Datalab instance and notebook). Maybe without being aware of that (i.e. kernel restart).

Thanks for pointing me in the right direction. This is now solved! — Pyy, Jan 16 '19 at 13:08

score 0 · Answer 2 · answered Jan 15 '19 at 13:36

0

Can you look at the Clouds for a failed job and post what you see?

answered Jan 15 '19 at 13:36

Eric Schmidt

1,257
12
12

Thanks for the reply! What exactly do you mean by look at the clouds? It might be helpful to note that the failures occurred due to mistakes in a BigQuery request and so they shouldn’t be related to the question. They are just shown in the screenshot alongside successful jobs to highlight the change in SDK version. – Pyy Jan 15 '19 at 13:51

Dataflow SDK version

2 Answers2