3

I'm submitting a training job to the GCP AI platform training service. My training dataset (around 40M rows on a BigQuery table in the same GCP project) needs to be preprocessed at the beginning of the training job as a pandas dataframe, so I tried both the solutions proposed by the GCP documentation:

  • pandas_gbq API: pd.read_gbq(query, project_id=PROJECT, dialect='standard', use_bqstorage_api=True)

  • google-cloud-bigquery API: client.query(query).to_dataframe(bqstorage_client=bqstorage_client)

Both methods work on a AI platform notebook VM, downloading the whole 40M rows dataset as a pandas dataframe in few minutes. I'm struggling to replicate the same procedure on the AI platform training server (that runs on a n1-highmem-16 machine). In the case of pandas-gbq API I obtain a permission denied error:

google.api_core.exceptions.PermissionDenied: 403 request failed: the user does not have bigquery.readsessions.create' permission for 'projects/acn-c4-crmdataplatform-dev'

In the case of the google-cloud-bigquery API there are no errors.

Here is the list of the required package that I, as suggested by the GCP documentation, pass to the AI platform training job with a setup.py file in the trainer package:

  • tensorflow==2.1.0
  • numpy==1.18.2
  • pandas==1.0.3
  • google-api-core==1.17.0
  • google-cloud-core==1.3.0
  • pyarrow==0.16.0
  • pandas-gbq==0.13.1
  • google-cloud-bigquery-storage==0.8.0
  • google-cloud-bigquery==1.24.0

1 Answers1

2

You have to do 2 things:

  • First, check that service account service-<PROJECT_NUMBER>@cloud-ml.google.com.iam.gserviceaccount.com exist and has the Cloud ML Service Agent role. If not, add it manually (you don't have to create it!)
  • Grant this service account the permission to query your BigQuery dataset.
guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76