2

my beam dataflow job succeeds locally (with DirectRunner) and fails on the cloud (with DataflowRunner)

The issue localized in this code snippet:

class SomeDoFn(beam.DoFn):
  ...
  def process(self, gcs_blob_path):
    gcs_client = storage.Client()
    bucket = gcs_client.get_bucket(BUCKET_NAME)
    blob = Blob(gcs_blob_path, bucket)

    # NEXT LINE IS CAUSING ISSUES! (when run remotely)
    url = blob.generate_signed_url(datetime.timedelta(seconds=300), method='GET')

and dataflow points to the error: "AttributeError: you need a private key to sign credentials.the credentials you are currently using just contains a token."

My dataflow job uses the service account (and appropriate service_account_email is provided in the PipelineOptions), however I don't see how I could pass the .json credentials file of that service account to the dataflow job. I suspect that locally my job runs successfully because I set the environment variable GOOGLE_APPLICATION_CREDENTIALS=<path to local file with service account credentials>, but how do I set it similarly for remote dataflow workers? Or maybe there is another solution, if anyone could help

Cœur
  • 37,241
  • 25
  • 195
  • 267
govordovsky
  • 359
  • 2
  • 17

2 Answers2

1

You will need to provide the service account JSON key similarly to what you are doing locally using the env variable GOOGLE_APPLICATION_CREDENTIALS.

To do so you can follow a few approaches mentioned in the answers to this question. Such as passing it using PipelineOptions

However, keep in mind that the safest way is to store the JSON key let's say in a GCP Bucket and get the file from there.

The easy but not safe workaround is getting the key, opening it, and in your code create a json object based on it to pass it later.

Waelmas
  • 1,894
  • 1
  • 9
  • 19
  • thanks. I think I understand the two approaches 1) keep .json in the gcs and 2) create .json locally and pass it to the workers. However, I'm not sure I got you right when you say "Such as passing it using PipelineOptions". Do you know the concrete option supported by gcp? As I mentioned in the question, I use `service_account_email` but don't see any other relevant option. – govordovsky Jan 02 '20 at 08:29
  • I meant using the "temp_location" and "staging_location" to specify the Bucket that will have the json key there. – Waelmas Jan 02 '20 at 09:00
1

You can see an example here on how to add custom options to your Beam pipeline. With this we can create a --key_file argument that will point to the credentials stored in GCS:

parser.add_argument('--key_file',
                  dest='key_file',
                  required=True,
                  help='Path to service account credentials JSON.')

This will allow you to add the --key_file gs://PATH/TO/CREDENTIALS.json flag when running the job.

Then, you can read it from within the job and pass it as a side input to the DoFn that needs to sign the blob. Starting from the example here we create a credentials PCollection to hold the JSON file:

credentials = (p 
  | 'Read Credentials from GCS' >> ReadFromText(known_args.key_file))

and we broadcast it to all workers processing the SignFileFn function:

(p
  | 'Read File from GCS' >> beam.Create([known_args.input]) \
  | 'Sign File' >> beam.ParDo(SignFileFn(), pvalue.AsList(credentials)))

Inside the ParDo, we build the JSON object to initialize the client (using the approach here) and sign the file:

class SignFileFn(beam.DoFn):
  """Signs GCS file with GCS-stored credentials"""
  def process(self, gcs_blob_path, creds):
    from google.cloud import storage
    from google.oauth2 import service_account

    credentials_json=json.loads('\n'.join(creds))
    credentials = service_account.Credentials.from_service_account_info(credentials_json)

    gcs_client = storage.Client(credentials=credentials)

    bucket = gcs_client.get_bucket(gcs_blob_path.split('/')[2])
    blob = bucket.blob('/'.join(gcs_blob_path.split('/')[3:]))

    url = blob.generate_signed_url(datetime.timedelta(seconds=300), method='GET')
    logging.info(url)
    yield url

See full code here

Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • this is the way to do it! thanks for the detailed answer – govordovsky Jan 05 '20 at 18:36
  • maybe one alternative would be to read credentials locally and pass it to the `SignFileFn` constructor as a string. Do you know whether one way or other has any benefits? – govordovsky Jan 05 '20 at 18:38
  • Yes, I also thought about that possibility which I think it would be simpler to implement but I thought this one would be better for auditing/controlling access (as it's retrieved using the Controller Service Account instead of the end user launching the job). Also it can be extended so that the side input is refreshed periodically in case you need to rotate the credentials file – Guillem Xercavins Jan 12 '20 at 14:38