3

I am studying kubeflow pipelines and how are the different components of the pipeline linked to each other. For this, I am using an example of MNIST project available on the official GitHub repository. But I am not able to understand the difference between vop.volume and mnist_training_container.pvolume in the below code snippet. From the documentation dsl.VolumeOp.add_volume I assume that vop.volume is kubernetes volume but I am unclear about pvolume and why is it linked to the training container and what is the difference between them.

vop = dsl.VolumeOp(
name="create_volume",
resource_name="data-volume", 
size="500Mi", 
modes=dsl.VOLUME_MODE_RWM)

# Create MNIST training component.
# train_op is from func_to_container_op which returns a kfp.dsl.ContainerOp. 
# To this container we assign a K8 volume using add_pvolumes.
mnist_training_container = train_op(data_path, model_file) \
                                .add_pvolumes({data_path: vop.volume})

# Create MNIST prediction component.
mnist_predict_container = predict_op(data_path, model_file, image_number) \
                                .add_pvolumes({data_path: mnist_training_container.pvolume})

1 Answers1

2

pvolume is a bit of a weird concept which is a bit alien in KFP. The idea was that a volume is being "passed" between components similarly to normal outputs (when actually it's the same volume).

We advice our users to avoid using the pvolume feature and avoid using volumes in the components. Otherwise, the components and pipelines are not portable and have limited usability.

Please check out the samples, tutorials and components. Almost no pipelines use volumes.

Please check the following two tutorials for Python and shell components. Check how the pipelines usually look like. example XGBoost training pipeline.

Ark-kun
  • 6,358
  • 2
  • 34
  • 70
  • and if you not use any of that options, how you can train with several files, like a cnn with a a lot of images that you need download, data augmentation, etc... before to launch fit? – Tlaloc-ES Jul 30 '22 at 13:46
  • Use native data passing to pass a directory of images. At some point in the pipeline you might decide to converts that set of images into binary dataset formats like TFRecord or Arrow Feather. Structure the pipeline as follows: Download->Augment->...->Train – Ark-kun Jul 31 '22 at 07:39
  • See some of my pipelines here: https://github.com/Ark-kun/pipeline_components/blob/6e9e4df/samples/Basic_ML_training/Train_tabular_regression_model_using_TensorFlow/pipeline.py – Ark-kun Jul 31 '22 at 07:40