1

I want to use a credential json file (or string) to authenticate a Beam job to read from a GCS bucket. Notably, the credentials are provided by a user (in an existing process so I'm stuck using the json file rather than a service account in my own GCP project).

What I've tried

  • Using fsspec/gcsfs: this seems to work, but I'm worried about scalability and I assume Beam's file io is more optimized to scale and plays nicer with Beam objects
  • Creating my own "storage_client" to pass to apache_beam.io.gcp.gcsio.GcsIO, trying to go off these docs. The storage client seems to be Beam's own internal client:
    • I've tried passing a google.cloud.storage.Client for this, and the client listed in the docs is deprecated with lots of things written in red on the repo, so I'd like to avoid using it.
    • I've tried passing google.auth credentials to it in a few different forms: the raw dict (read by parsing the json), google.auth.load_credentials_from_file-based creds, and impersonated credentials
  • Creating from apache_beam.options.pipeline_options.GoogleCloudOptions to pass to a apache_beam.io.gcp.gcsfilesystem.GCSFileSystem. I found these docs from beam and these docs from google, but I haven't been able to piece together how to pass json credentials for authentication.
  • I've dabbled in the idea of using Google KMS for passing the creds securely, but (as far as I can tell) it doesn't solve my problem for getting Beam to use the credentials.

Small-ish Example

It'd be nice to get something like this working:

from apache_beam.options.pipeline_options import GoogleCloudOptions, PipelineOptions
import apache_beam as beam
from apache_beam.io.gcp.gcsio import GcsIO
from apache_beam.io.gcp.gcsfilesystem import GCSFileSystem

# need to support "type": "service_account", but would be nice to be able to support "type": "authorized_user" too, if possible
cred_file = "/path/to/file.json"

# not sure which of these is preferable for this...
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
    
# voodoo

with beam.Pipeline(options=options) as p:
    gcs = GcsIO()
    gcs.open("gs://path/to/file.txt")

    # or
    gcs = GCSFileSystem(options)
    gcs.open("gs://path/to/file.txt")

Happy to provide any other info that could help, as you can see I've been at this for a while now.

Patrick
  • 530
  • 5
  • 9

1 Answers1

1

You can add service account credentials to your code as below.

from google.oauth2 import service_account

key_path = "path/to/service_account.json"  
credentials = service_account.Credentials.from_service_account_file(
    key_path,
    scopes=["https://www.googleapis.com/auth/cloud-platform"],)
Prajna Rai T
  • 1,666
  • 3
  • 15
  • So I know how to read the credentials in, but how to I get beam to use those credentials when reading in files? – Patrick Mar 16 '23 at 21:26
  • Does this [link](https://beam.apache.org/releases/pydoc/2.4.0/_modules/apache_beam/internal/gcp/auth.html) help you? You can refer to this [document](https://towardsdatascience.com/apache-beam-pipeline-for-cleaning-batch-data-using-cloud-dataflow-and-bigquery-f9272cd89eba) to read data from GCS bucket using Apache beam. – Prajna Rai T Mar 17 '23 at 12:14
  • @PrajnaRaiT do you have a similar method when using ADC-generated credentials (so not type `service_account`) by running `gcloud application-default login`, which are of type `authorized_user` (refresh token)? – fpopic May 13 '23 at 14:23