I need to make a pipeline to create a dataset for training a model. I am using two data sources. I would like to be able to decide whether to use one source, the other, or both. To do this, I have created two pipeline parameters, for example:
use_first
and use_second
.
My pipeline looks like this simplistically:
@dsl.pipeline(
name="data processing pipeline"
)
def data_processing(
use_first: bool,
use_second: bool
):
load_first = load_first_comp(
use_first=use_first
)
load_second = load_second_comp(
use_second=use_second
)
corpora_combining = corpora_combining_comp(
first_dir=load_first.output,
second_dir=load_second.output
)
When use_first
or use_second
equals True, then the dataset is processed, if not then this code is executed inside the component:
if not use_first:
return
The first_dir and second_dir parameters in the coropora_combining component can be False and then folders containing these datasets are skipped.
I know that in kubeflow you can use dsl.Condition to create conditions in pipeline, but I don't know how to use it to make it work the way I want. I tried using it in two ways. I set the parameter values this way:
use_first = True
use_second = False
At first, I did it like this:
@dsl.pipeline(
name="data processing pipeline"
)
def data_processing(
use_first: bool,
use_second: bool
):
with dsl.Condition(use_first == True):
load_first = load_first_comp()
first_dir = load_first.output
with dsl.Condition(use_first == False):
first_dir = False
with dsl.Condition(use_second == True):
load_second = load_second_comp()
second_dir = load_second.output
with dsl.Condition(use_second == False):
second_dir = False
corpora_combining = corpora_combining_comp(
first_dir=first_dir,
second_dir=second_dir
)
In this way I received an error with such a message:
Pipeline must have at least one task.
The second way was like this:
@dsl.pipeline(
name="data processing pipeline"
)
def data_processing(
use_first: bool,
use_second: bool
):
first_dir = False
second_dir = False
with dsl.Condition(use_first == True):
load_first = load_first_comp()
first_dir = load_first.output
with dsl.Condition(use_second == True):
load_second = load_second_comp()
second_dir = load_second.output
corpora_combining = corpora_combining_comp(
first_dir=first_dir,
second_dir=second_dir
)
Using this method the pipeline started, but the corpora_combining step failed because the output from the load_second step could not be obtained. I thought this piece of code would be omitted:
second_dir = load_second.output
How can the if else statement be simulated in this case using dsl.Condition?