3

I have my machine learning datasets in DVC. It's relatively simple to version the dataset with DVC + git.

Now, as all of the training and deployment have been moved to Vertex AI, I'm trying to version my datasets.

My dataset changes a lot, for example for each month I have to grab new features from production and it becomes a new version of the dataset, or maybe an addition of new features.

At the moment, I am uploading the datasets manually through UI, but I don't find any options to change / update the dataset with a new version.

Shcheklein
  • 5,979
  • 7
  • 44
  • 53
Zabir Al Nazi
  • 10,298
  • 4
  • 33
  • 60
  • 1
    Zabir, could you please provide a bit more details. What kind of data are we taking about (tabular?). Could you share a link to the UI that you mentioned (docs). – Shcheklein Oct 02 '22 at 02:50
  • @Shcheklein it's a tabular dataset, actually, with DVC it's stored (GCS bucket) in parquet format. I convert the parquet file to CSV and upload it to Vertex AI using the UI. By "UI", I just mean the vertex AI web page - [ui screenshot](https://miro.medium.com/max/4800/0*TmL8Obg2rTIVsoPJ) – Zabir Al Nazi Oct 02 '22 at 07:03

1 Answers1

1

There's not currently any option for versions for datasets. If the underlying data is the same, you could potentially export the annotation set (first option in the snowman menu in the top-right corner when viewing the annotation set), import it, and manually track / identify the versions. That's a bit hacky and definitely not an optimal user experience, but it could work.

If you need to version the entire dataset, there's not a good way to do that currently without manually managing / naming / tracking it.

Depending on your use case, BigQuery might work as a source for this, and then the data could be imported into Vertex datasets from there. This might help: https://christianlauer90.medium.com/how-to-realize-data-versioning-in-google-bigquery-fb5044a0691f