0

Hello I am trying to share file between steps, and In order to do this I have the following code:

VOLUME_NAME_PATH = 'pictures'
VOLUME_PATH = f'/{VOLUME_NAME_PATH}'
V1_VOLUME = k8s_client.V1Volume(name=VOLUME_NAME_PATH)
V1_VOLUME_MOUNT = k8s_client.V1VolumeMount(
                    mount_path=VOLUME_PATH,
                    name=VOLUME_NAME_PATH
                )

def pictures_pipeline():
    download_images_op_step = download_images_op(volume_path=VOLUME_PATH) \
        .add_volume(V1_VOLUME) \
        .add_volume_mount(V1_VOLUME_MOUNT)
    compress_images_op_step = compress_images_op(volume_path=VOLUME_PATH) \
        .add_volume(V1_VOLUME) \
        .add_volume_mount(V1_VOLUME_MOUNT)

    compress_images_op_step.after(download_images_op_step)

As you can I see I am creating a V1_VOLUMNE, and mounth the same for the all steps in the pipeline.

THe first step download_images_op_step, download and save the pictures in the volume, but when the second step starts the the volume is empty.

So how can I persis the data from one to another?

Thanks

Tlaloc-ES
  • 4,825
  • 7
  • 38
  • 84
  • I think you need to include the code of the `download_images_op` and `compress_images_op` in your question. – Ark-kun Aug 07 '22 at 00:22

1 Answers1

1

Please check my answer to a similar question about volumes: https://stackoverflow.com/a/67898164/1497385

The short answer is that the usage of volumes is not a supported way of passing data between components in KFP. I'm not saying it cannot work, but if a developer goes out of the officially supported data passing method they're on their own.

Using KFP without KFP's data passing is pretty close to not using KFP at all...

Here is how to pass data properly:

from kfp.components import InputPath, OutputPath, create_component_from_func

def download_images(
    url: str,
    output_path: OutputPath(),
):
    ...
    # Create directory at output_path
    # Put all images into it

download_images_op = create_component_from_func(download_images)

def compress_images(
    input_path: InputPath(),
    output_path: OutputPath(),
):
    # read images from input_path
    # write results to output_path

compress_images_op = create_component_from_func(compress_images)

def my_pipeline():
    images = download_images_op(
        url=...,
    ).outputs["output"]

    compressed_images = compress_images_op (
        input=images,
    ).outputs["output"]

You can also find many examples of real-world components in this repo: https://github.com/Ark-kun/pipeline_components/tree/master/components

P.S. As a small team we've spent so much time answering user questions about volumes not working despite the official documentation and all samples and tutorials showing how to use proper methods and never suggesting to use volumes. I want to understand where this comes from. Is there some unofficial KFP tutorial on the Internet that teaches users that the users should pass data via volumes?

Ark-kun
  • 6,358
  • 2
  • 34
  • 70
  • HI @ark-kun thanks for the support, I read a lot of kubeflow documentation about pvolumes and how to use them, and I choose that because I still don't understand how can I set output_path = '/files/download' and another question here another user told me about that. have you any method to send private messages in order to ask only if you can of course. – Tlaloc-ES Aug 06 '22 at 14:05
  • >"how can I set output_path = '/files/download'" - Is this a business need? Is this a part of a client contract which says "The developer must set output_path (which??) to /files/download (on which filesysyem??". Probably not. Your business needs are probably different. It would be easier if you state what you want to accomplish. I think you just need to pass data between components. You do not need" /files/download" and I'm not sure such path makes sense since it's inside ethemeral container. If you have a weird program with hardcoded path, then yoi can always add a wrapper to copy files. – Ark-kun Aug 06 '22 at 21:19
  • the question is about if I download 1000 files in the first op, the second op need train and the next op validate, how can I set 1000 files from op 1 to op 2 and op 3? The example that I watched was reading a file like a byte but only one, so the first question is how can I pass a lot of data from step A to step N, and the second one is about that data where is storage in the ram? I will need 1 GB RAM extra for each 1000 files? – Tlaloc-ES Aug 06 '22 at 21:48
  • Which programming language are you using? Say, for downloading. Is it Python? Shell script? Something else? – Ark-kun Aug 07 '22 at 00:13
  • The first component `download_images` creates a directory with path received as the `output_path` parameter value and then downloads all images into that path. – Ark-kun Aug 07 '22 at 00:14
  • The second component `compress_images` creates a directory with path received as the `output_path` parameter value, then reads all images from the `input_path` directory then writes compressed images into the `output_path` directory. – Ark-kun Aug 07 '22 at 00:16
  • The pipeline connects the output of the `download_images_op` component to the input of the `compress_images_op` component. You do not need any RAM for this. – Ark-kun Aug 07 '22 at 00:18