Read vertex ai datasets in jupyter notebook

Question

I am trying to create a python utility that will take dataset from vertex ai datasets and will generate statistics for that dataset. But I am unable to check the dataset using jupyter notebook. Is there any way out for this?

So you want to call VertexAI dataset from Jupyther Notebooks. What is that dataset, its text one? What statistics would you like to get? Number of words? Please elaborate it. — PjoterS, Sep 01 '21 at 10:08
Yes I want to get statistics for text and tabular data both. Stats like std_dev, count based features for tabular data and for text data length of sentences, words count, char count. Is there any way to do that? — Ajinkya Mishrikotkar, Sep 01 '21 at 10:25
Could you elaborate? Onlyt things comes to my minds are [Vertex Pipelines: Metrics visualization and run comparison using the KFP SDK](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/pipelines/metrics_viz_run_compare_kfp.ipynb) and [Vertex Pipelines: model train, upload, and deploy using google-cloud-pipeline-components](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/pipelines/google_cloud_pipeline_components_model_train_upload_deploy.ipynb) but not sure if that is what you are asking. Some scenario/use-case? — PjoterS, Sep 01 '21 at 14:33

score 0 · Accepted Answer · answered Sep 14 '21 at 15:54

If I understand correctly, you want to use Vertex AI dataset inside Jupyter Notebook. I don't think that this is currently possible. You are able to export Vertex AI datasets to Google Cloud Storage in JSONL format:

Your dataset will be exported as a list of text items in JSONL format. Each row contains a Cloud Storage path, any label(s) assigned to that item, and a flag that indicates whether that item is in the training, validation, or test set.

At this moment, you can use BigQuery data inside Notebook using %%bigquery like it's mentioned in Visualizing BigQuery data in a Jupyter notebook. or use csv_read() from machine directory or GCS like it's showed in the How to read csv file in Google Cloud Platform jupyter notebook thread.

However, you can fill a Feature Request in Google Issue Tracker to add the possibility to use VertexAI dataset directly in the Jupyter Notebook which will be considered by the Google Vertex AI Team.

score 0 · Answer 2 · answered Mar 22 '22 at 14:24

Please correct me if I am wrong, are you trying to access vertex ai dataset which is in you gcp project into the jupyter notebook ? If so, try below code and see if you can access dataset.

def list_datasets(project_id, compute_region, filter=None):
"""List all datasets."""
result = []
# [START automl_tables_list_datasets]
# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# filter = 'filter expression here'

from google.cloud import automl_v1beta1 as automl

client = automl.TablesClient(project=project_id, region=compute_region)
print('client:',client)
# List all the datasets available in the region by applying filter.
response = client.list_datasets(filter=filter)

print("List of datasets:")
for dataset in response:
    # Display the dataset information.
    print("Dataset name: {}".format(dataset.name))
    print("Dataset id: {}".format(dataset.name.split("/")[-1]))
    print("Dataset display name: {}".format(dataset.display_name))
    metadata = dataset.tables_dataset_metadata
    print(
        "Dataset primary table spec id: {}".format(
            metadata.primary_table_spec_id
        )
    )
    print(
        "Dataset target column spec id: {}".format(
            metadata.target_column_spec_id
        )
    )
    print(
        "Dataset target column spec id: {}".format(
            metadata.target_column_spec_id
        )
    )
    print(
        "Dataset weight column spec id: {}".format(
            metadata.weight_column_spec_id
        )
    )
    print(
        "Dataset ml use column spec id: {}".format(
            metadata.ml_use_column_spec_id
        )
    )
    print("Dataset example count: {}".format(dataset.example_count))
    print("Dataset create time: {}".format(dataset.create_time))
    print("\n")

    # [END automl_tables_list_datasets]
    result.append(dataset)

return result

you require to pass project_id and comupte_region while calling this function.

Read vertex ai datasets in jupyter notebook

2 Answers2