We have a Computer Vision project. Raw Data stores in S3. Label Team every day send new increment of labeled data. We want to automize train process with these new data. We use dvc
for reproducing pipelines and ML Flow
for logging and deploying models and airflow
for scheduling executions in K8S. Also we can produce new branch and modify model params or architecture and trigger train pipeline in Gitlab CI manually. These pipeline do the same that airflow task.
We want to checkout those raw data which labeled by label team on PV for avoiding pulling huge data every run from S3. Every time when we'll run dvc pipeline, which pull new labeled data and corresponding raw data from S3, produce preprocessing, train model and calculate metrics. In dvc we'll be versioning pipeline code, labeled data and model params. But here we don't version raw and preprocessed data, which means that only one pipeline can be run at the moment.
We can versionize raw and preprocessed data and use sharing cache in dvc, but here we produce a lot of replicas in in cache and in working area, because if we want to add new labeled data, we should do dvc unprotect raw_data
which copy cached data on our local workspace (PV in k8s).
How to track the integrity of raw data and keep ability to run several experiments at the same time and don't produce a lot of copies of data? Do it optimal way to store data on PV in k8s? Should we use shared cache?