How can I use a lot pictures in a kubeflow pipeline?

Question

Hello I am training a yolo in a kubeflow pipeline, in order to this, I have a set of pictures more than 1GB.

Currently, I download all images from minio to the container with a script and after that I train the model.

I am not sure if is there any best practice about this, because downloading 1GB per each training is a lot.

is there another way to do this and avoiding building a minio scripts to download picture dataset? can I use a shared volume or something like that in order to share files between operators (the idea is to train another model with the same dataset)

score 1 · Accepted Answer · answered Jul 31 '22 at 08:36

We advise you use KFP's built-in data passing methods. This way you get reproducibility, immutability caching etc.

You should split your pipeline into multiple components: Download->Preprocess->Train This way, the outputs of the Download task are cached and it's never executed again. Same with the Preprocess task.

downloading 1GB per each training is a lot.

Kubernetes volumes are connected through network anyway. Getting data from one machine to another is "downloading" no matter how it's done. What you want to do with volumes is actually slower. When you train for 100 epochs, with KFP data passing, the data is only downloaded/mounted once. With shared volume, the data will be downloaded 100 times.

score -1 · Answer 2 · answered Jul 28 '22 at 00:35

-1

Why aren't you created a shared volume on your Kubernetes cluster with the data loaded there? This would save a tremendous amount of time and would mean the data is accessible to all Kubernetes jobs, not just the KFP run.

answered Jul 28 '22 at 00:35

regularlearner

419
3
11

I am using pvolumes, finally because I couldn't load k8s volumes – Tlaloc-ES Jul 28 '22 at 07:37

How can I use a lot pictures in a kubeflow pipeline?

2 Answers2