DVC experiments with large data in kubernetes

Question

We have a Computer Vision project. Raw Data stores in S3. Label Team every day send new increment of labeled data. We want to automize train process with these new data. We use dvc for reproducing pipelines and ML Flow for logging and deploying models and airflow for scheduling executions in K8S. Also we can produce new branch and modify model params or architecture and trigger train pipeline in Gitlab CI manually. These pipeline do the same that airflow task.

We want to checkout those raw data which labeled by label team on PV for avoiding pulling huge data every run from S3. Every time when we'll run dvc pipeline, which pull new labeled data and corresponding raw data from S3, produce preprocessing, train model and calculate metrics. In dvc we'll be versioning pipeline code, labeled data and model params. But here we don't version raw and preprocessed data, which means that only one pipeline can be run at the moment.

We can versionize raw and preprocessed data and use sharing cache in dvc, but here we produce a lot of replicas in in cache and in working area, because if we want to add new labeled data, we should do dvc unprotect raw_data which copy cached data on our local workspace (PV in k8s).

How to track the integrity of raw data and keep ability to run several experiments at the same time and don't produce a lot of copies of data? Do it optimal way to store data on PV in k8s? Should we use shared cache?

Am I correct that PV is mounted to each pod (the one that runs the pipeline, ones that label team is using, etc)? I'm not sure why would you need `dvc unprotect raw_data` at all if you don't touch files/label, if you only add them. Overall, could you describe some specific details - amount of data, how labels are stored, example of the pipeline, etc. That would help. — Shcheklein, Feb 19 '23 at 23:48

score 0 · Answer 1 · answered Apr 02 '23 at 11:03

It sounds like having the ability to stream your data instead of uploading and downloading each time would solve your problem. Especially if used together with data versioning.

I'd recommend checking out DagsHub's DDA (Direct Data Access) feature for this.

https://dagshub.com/docs/feature_guide/direct_data_access/

DVC experiments with large data in kubernetes

1 Answers1