How to Access Managed Dataset in Vertex AI using Custom Container

Question

In the google cloud documentation below:

https://cloud.google.com/vertex-ai/docs/training/using-managed-datasets#access_a_dataset_from_your_training_application

It says that the following environment variables are sent to the training container:

AIP_DATA_FORMAT: The format that your dataset is exported in. Possible values include: jsonl, csv, or bigquery.
AIP_TRAINING_DATA_URI: The location that your training data is stored at.
AIP_VALIDATION_DATA_URI: The location that your validation data is stored at.
AIP_TEST_DATA_URI: The location that your test data is stored at.

Where each of the URI values are wildcards that annotate training, validation, and test data files in .jsonl format as such:

gs://bucket_name/path/training-*
gs://bucket_name/path/validation-*
gs://bucket_name/path/test-*

Now, in your custom container that contains the python code, how do you actually access the contents of each of the files?

I've tried splitting the URI string using the following regex to obtain the bucket_name and the prefix info, and attempted the grab it using bucket.list_blobs(delimiter='/', prefix=prefix[:-1]) but it returns nothing when the files are definitely there. Here is a minimal example of the attempted code:

import os
import re
from google.cloud import storage

aip_training_data_uri = os.environ.get('AIP_TRAINING_DATA_URI')
match = re.match('gs://(.*?)/(.*)', aip_training_data_uri)
bucket_name, prefix = match.groups()

client = storage.Client()
bucket = client.bucket(bucket_name)
blobs = bucket.list_blobs(delimiter='/', prefix=prefix[:-1]) # "[:-1]" to remove wildcard asterisks

for blob in blobs:
   print(blob.download_as_string()) # This returns an empty string

Did you manually check your bucket if those directories exist? Can you include your code for creating and running the custom job? — Ricco D, Sep 05 '22 at 02:15
@RiccoD Yes, the buckets were manually checked to see that the files exist. Would you happen to know how AutoML parses the Vertex AI datasets? That may provide some guidance on how Google handles this on your end. — termlim, Sep 05 '22 at 16:16
Just to confirm, were the files created and exists on the bucket? Also as mentioned previously, can you also show your vertex AI code? Just so it is clear how the dataset is created, how it was trained, etc. — Ricco D, Sep 06 '22 at 00:56

How to Access Managed Dataset in Vertex AI using Custom Container

0 Answers0