How can you specify local path of InputPath or OutputPath in Kubeflow Pipelines

Question

I've started using Kubeflow Pipelines to run data processing, training and predicting for a machine learning project, and I'm using InputPath and OutputhPath to pass large files between components.

I'd like to know how, if it's possible, do I set the path that OutputPath would look for a file in in a component, and where InputPath would load a file in a component.

Currently, the code stores them in a pre-determined place (e.g. data/my_data.csv), and it would be ideal if I could 'tell' InputPath/OutputPath this is the file it should copy, instead of having to rename all the files to match what OutputPath expects, as per below minimal example.

@dsl.pipelines(name='test_pipeline')
def pipeline():
    pp = create_component_from_func(func=_pre_process_data)()
    # use pp['pre_processed']...

def pre_process_data(pre_processed_path: OutputPath('csv')):
    import os

    print('do some processing which saves file to data/pre_processed.csv')

    # want to avoid this:
    print('move files to OutputPath locations...')
    os.rename(f'data/pre_processed.csv', pre_processed_path)

Naturally I would prefer not to update the code to adhere to Kubeflow pipeline naming convention, as that seems like very bad practice to me.

Thanks!

joe.liedtke · Accepted Answer · 2020-08-19T21:00:27.670

3

Update - See ark-kun's comment, the approach in my original answer is deprecated and should not be used. It is better to let Kubeflow Pipelines specify where you should store your pipeline's artifacts.

For lightweight components (such as the one in your example), Kubeflow Pipelines builds the container image for your component and specifies the paths for inputs and outputs (based upon the types you use to decorate your component function). I would recommend using those paths directly, instead of writing to one location and then renaming the file. The Kubeflow Pipelines samples follow this pattern.

For reusable components, you define the pipeline inputs and outputs as part of the YAML specification for the component. In that case you can specify your preferred location for the output files. That being said, reusable components take a bit more effort to create, since you need to build a Docker container image and component specification in YAML.

edited Aug 19 '20 at 21:00

answered Apr 03 '20 at 21:41

joe.liedtke

582
2
14

Thanks, I guess what I'm looking for doesn't exist with lightweight components then. interesting that is available for reusable components though, good to know. Guess I'll have to edit the underlying code for now then, and make the filename optional set-able. – Jonas D Apr 06 '20 at 16:07
2

Note that `implementation.container.fileOutputs` feature is a deprecated-from-the-start legacy feature that we've failed to remove before the first release. It's not supposed to be used in any new components and will be fully deprecated and removed in the future. We advice our users to use the generated paths provided by InputPath/OutputPath and not hardcode any paths in the component code. – Ark-kun Jul 11 '20 at 06:02

score 1 · Answer 2 · answered Jul 31 '22 at 07:46

This is not supported by the system. Components should use the system-provided paths. This is important, because on some execution engines the data is mounted to those paths. And sometimes these paths have certain restrictions or might even be unchangeable. So the system must have the freedom to choose the paths.

Usually, good programs do not hard-code any absolute paths inside their code, but rather receive the paths from the command line.

In any case, it's pretty easy to copy the files from or to the system-provided paths (as you already do in the code).

How can you specify local path of InputPath or OutputPath in Kubeflow Pipelines

2 Answers2