I am trying to create a python utility that will take dataset from vertex ai datasets and will generate statistics for that dataset. But I am unable to check the dataset using jupyter notebook. Is there any way out for this?
-
So you want to call VertexAI dataset from Jupyther Notebooks. What is that dataset, its text one? What statistics would you like to get? Number of words? Please elaborate it. – PjoterS Sep 01 '21 at 10:08
-
Yes I want to get statistics for text and tabular data both. Stats like std_dev, count based features for tabular data and for text data length of sentences, words count, char count. Is there any way to do that? – Ajinkya Mishrikotkar Sep 01 '21 at 10:25
-
Could you elaborate? Onlyt things comes to my minds are [Vertex Pipelines: Metrics visualization and run comparison using the KFP SDK](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/pipelines/metrics_viz_run_compare_kfp.ipynb) and [Vertex Pipelines: model train, upload, and deploy using google-cloud-pipeline-components](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/pipelines/google_cloud_pipeline_components_model_train_upload_deploy.ipynb) but not sure if that is what you are asking. Some scenario/use-case? – PjoterS Sep 01 '21 at 14:33
2 Answers
If I understand correctly, you want to use Vertex AI dataset inside Jupyter Notebook
. I don't think that this is currently possible. You are able to export Vertex AI
datasets to Google Cloud Storage
in JSONL format:
Your dataset will be exported as a list of text items in JSONL format. Each row contains a Cloud Storage path, any label(s) assigned to that item, and a flag that indicates whether that item is in the training, validation, or test set.
At this moment, you can use BigQuery
data inside Notebook
using %%bigquery
like it's mentioned in Visualizing BigQuery data in a Jupyter notebook. or use csv_read()
from machine directory or GCS
like it's showed in the How to read csv file in Google Cloud Platform jupyter notebook thread.
However, you can fill a Feature Request
in Google Issue Tracker to add the possibility to use VertexAI
dataset directly in the Jupyter Notebook
which will be considered by the Google Vertex AI Team
.

- 12,841
- 1
- 22
- 54
Please correct me if I am wrong, are you trying to access vertex ai dataset which is in you gcp project into the jupyter notebook ? If so, try below code and see if you can access dataset.
def list_datasets(project_id, compute_region, filter=None):
"""List all datasets."""
result = []
# [START automl_tables_list_datasets]
# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# filter = 'filter expression here'
from google.cloud import automl_v1beta1 as automl
client = automl.TablesClient(project=project_id, region=compute_region)
print('client:',client)
# List all the datasets available in the region by applying filter.
response = client.list_datasets(filter=filter)
print("List of datasets:")
for dataset in response:
# Display the dataset information.
print("Dataset name: {}".format(dataset.name))
print("Dataset id: {}".format(dataset.name.split("/")[-1]))
print("Dataset display name: {}".format(dataset.display_name))
metadata = dataset.tables_dataset_metadata
print(
"Dataset primary table spec id: {}".format(
metadata.primary_table_spec_id
)
)
print(
"Dataset target column spec id: {}".format(
metadata.target_column_spec_id
)
)
print(
"Dataset target column spec id: {}".format(
metadata.target_column_spec_id
)
)
print(
"Dataset weight column spec id: {}".format(
metadata.weight_column_spec_id
)
)
print(
"Dataset ml use column spec id: {}".format(
metadata.ml_use_column_spec_id
)
)
print("Dataset example count: {}".format(dataset.example_count))
print("Dataset create time: {}".format(dataset.create_time))
print("\n")
# [END automl_tables_list_datasets]
result.append(dataset)
return result
you require to pass project_id and comupte_region while calling this function.

- 747
- 3
- 9
- 24