How to use pandas-gbq with BigQuery Storage API within AI platform training?

Question

I'm submitting a training job to the GCP AI platform training service. My training dataset (around 40M rows on a BigQuery table in the same GCP project) needs to be preprocessed at the beginning of the training job as a pandas dataframe, so I tried both the solutions proposed by the GCP documentation:

pandas_gbq API: pd.read_gbq(query, project_id=PROJECT, dialect='standard', use_bqstorage_api=True)
google-cloud-bigquery API: client.query(query).to_dataframe(bqstorage_client=bqstorage_client)

Both methods work on a AI platform notebook VM, downloading the whole 40M rows dataset as a pandas dataframe in few minutes. I'm struggling to replicate the same procedure on the AI platform training server (that runs on a n1-highmem-16 machine). In the case of pandas-gbq API I obtain a permission denied error:

google.api_core.exceptions.PermissionDenied: 403 request failed: the user does not have bigquery.readsessions.create' permission for 'projects/acn-c4-crmdataplatform-dev'

In the case of the google-cloud-bigquery API there are no errors.

Here is the list of the required package that I, as suggested by the GCP documentation, pass to the AI platform training job with a setup.py file in the trainer package:

tensorflow==2.1.0
numpy==1.18.2
pandas==1.0.3
google-api-core==1.17.0
google-cloud-core==1.3.0
pyarrow==0.16.0
pandas-gbq==0.13.1
google-cloud-bigquery-storage==0.8.0
google-cloud-bigquery==1.24.0

score 2 · Answer 1 · answered May 14 '20 at 19:06

You have to do 2 things:

First, check that service account service-<PROJECT_NUMBER>@cloud-ml.google.com.iam.gserviceaccount.com exist and has the Cloud ML Service Agent role. If not, add it manually (you don't have to create it!)
Grant this service account the permission to query your BigQuery dataset.

How to use pandas-gbq with BigQuery Storage API within AI platform training?

1 Answers1