I've started using Kubeflow Pipelines to run data processing, training and predicting for a machine learning project, and I'm using InputPath and OutputhPath to pass large files between components.
I'd like to know how, if it's possible, do I set the path that OutputPath would look for a file in in a component, and where InputPath would load a file in a component.
Currently, the code stores them in a pre-determined place (e.g. data/my_data.csv
), and it would be ideal if I could 'tell' InputPath/OutputPath this is the file it should copy, instead of having to rename all the files to match what OutputPath expects, as per below minimal example.
@dsl.pipelines(name='test_pipeline')
def pipeline():
pp = create_component_from_func(func=_pre_process_data)()
# use pp['pre_processed']...
def pre_process_data(pre_processed_path: OutputPath('csv')):
import os
print('do some processing which saves file to data/pre_processed.csv')
# want to avoid this:
print('move files to OutputPath locations...')
os.rename(f'data/pre_processed.csv', pre_processed_path)
Naturally I would prefer not to update the code to adhere to Kubeflow pipeline naming convention, as that seems like very bad practice to me.
Thanks!