In the google cloud documentation below:
It says that the following environment variables are sent to the training container:
AIP_DATA_FORMAT: The format that your dataset is exported in. Possible values include: jsonl, csv, or bigquery.
AIP_TRAINING_DATA_URI: The location that your training data is stored at.
AIP_VALIDATION_DATA_URI: The location that your validation data is stored at.
AIP_TEST_DATA_URI: The location that your test data is stored at.
Where each of the URI values are wildcards that annotate training, validation, and test data files in .jsonl
format as such:
gs://bucket_name/path/training-*
gs://bucket_name/path/validation-*
gs://bucket_name/path/test-*
Now, in your custom container that contains the python code, how do you actually access the contents of each of the files?
I've tried splitting the URI string using the following regex to obtain the bucket_name
and the prefix
info, and attempted the grab it using bucket.list_blobs(delimiter='/', prefix=prefix[:-1])
but it returns nothing when the files are definitely there. Here is a minimal example of the attempted code:
import os
import re
from google.cloud import storage
aip_training_data_uri = os.environ.get('AIP_TRAINING_DATA_URI')
match = re.match('gs://(.*?)/(.*)', aip_training_data_uri)
bucket_name, prefix = match.groups()
client = storage.Client()
bucket = client.bucket(bucket_name)
blobs = bucket.list_blobs(delimiter='/', prefix=prefix[:-1]) # "[:-1]" to remove wildcard asterisks
for blob in blobs:
print(blob.download_as_string()) # This returns an empty string