I want to use a credential json file (or string) to authenticate a Beam job to read from a GCS bucket. Notably, the credentials are provided by a user (in an existing process so I'm stuck using the json file rather than a service account in my own GCP project).
What I've tried
- Using
fsspec
/gcsfs
: this seems to work, but I'm worried about scalability and I assume Beam's file io is more optimized to scale and plays nicer with Beam objects - Creating my own "storage_client" to pass to
apache_beam.io.gcp.gcsio.GcsIO
, trying to go off these docs. The storage client seems to be Beam's own internal client:- I've tried passing a
google.cloud.storage.Client
for this, and the client listed in the docs is deprecated with lots of things written in red on the repo, so I'd like to avoid using it. - I've tried passing
google.auth
credentials to it in a few different forms: the raw dict (read by parsing the json),google.auth.load_credentials_from_file
-based creds, and impersonated credentials
- I've tried passing a
- Creating
from apache_beam.options.pipeline_options.GoogleCloudOptions
to pass to aapache_beam.io.gcp.gcsfilesystem.GCSFileSystem
. I found these docs from beam and these docs from google, but I haven't been able to piece together how to pass json credentials for authentication. - I've dabbled in the idea of using Google KMS for passing the creds securely, but (as far as I can tell) it doesn't solve my problem for getting Beam to use the credentials.
Small-ish Example
It'd be nice to get something like this working:
from apache_beam.options.pipeline_options import GoogleCloudOptions, PipelineOptions
import apache_beam as beam
from apache_beam.io.gcp.gcsio import GcsIO
from apache_beam.io.gcp.gcsfilesystem import GCSFileSystem
# need to support "type": "service_account", but would be nice to be able to support "type": "authorized_user" too, if possible
cred_file = "/path/to/file.json"
# not sure which of these is preferable for this...
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
# voodoo
with beam.Pipeline(options=options) as p:
gcs = GcsIO()
gcs.open("gs://path/to/file.txt")
# or
gcs = GCSFileSystem(options)
gcs.open("gs://path/to/file.txt")
Happy to provide any other info that could help, as you can see I've been at this for a while now.