4

I need to make a pipeline to create a dataset for training a model. I am using two data sources. I would like to be able to decide whether to use one source, the other, or both. To do this, I have created two pipeline parameters, for example: use_first and use_second.

My pipeline looks like this simplistically:

@dsl.pipeline(
    name="data processing pipeline"
)
def data_processing(
    use_first: bool,
    use_second: bool
):
    load_first = load_first_comp(
        use_first=use_first
    )
    
    load_second = load_second_comp(
        use_second=use_second
    )

    corpora_combining = corpora_combining_comp(
        first_dir=load_first.output,
        second_dir=load_second.output
    )

When use_first or use_second equals True, then the dataset is processed, if not then this code is executed inside the component:

if not use_first:
    return

The first_dir and second_dir parameters in the coropora_combining component can be False and then folders containing these datasets are skipped.

I know that in kubeflow you can use dsl.Condition to create conditions in pipeline, but I don't know how to use it to make it work the way I want. I tried using it in two ways. I set the parameter values this way:

use_first = True
use_second = False

At first, I did it like this:

@dsl.pipeline(
    name="data processing pipeline"
)
def data_processing(
    use_first: bool,
    use_second: bool
):
    with dsl.Condition(use_first == True):
        load_first = load_first_comp()
        first_dir = load_first.output

    with dsl.Condition(use_first == False):
        first_dir = False

    with dsl.Condition(use_second == True):
        load_second = load_second_comp()
        second_dir = load_second.output

    with dsl.Condition(use_second == False):
        second_dir = False

    corpora_combining = corpora_combining_comp(
        first_dir=first_dir,
        second_dir=second_dir
    )

In this way I received an error with such a message: Pipeline must have at least one task.

The second way was like this:

@dsl.pipeline(
    name="data processing pipeline"
)
def data_processing(
    use_first: bool,
    use_second: bool
):
    first_dir = False
    second_dir = False

    with dsl.Condition(use_first == True):
        load_first = load_first_comp()
        first_dir = load_first.output

    with dsl.Condition(use_second == True):
        load_second = load_second_comp()
        second_dir = load_second.output

    corpora_combining = corpora_combining_comp(
        first_dir=first_dir,
        second_dir=second_dir
    )

Using this method the pipeline started, but the corpora_combining step failed because the output from the load_second step could not be obtained. I thought this piece of code would be omitted:

second_dir = load_second.output

How can the if else statement be simulated in this case using dsl.Condition?

nietoperz21
  • 303
  • 3
  • 12

1 Answers1

1

I made a skip component for this.

@component(base_image="python:3.9")
def skip() -> str:
    import logging

    logger = logging.getLogger(__name__)
    logger.info("Skipping")
    return "skipped"

This way you could

    with dsl.Condition(use_first == True):
        load_first = load_first_comp()
        first_dir = load_first.output

    with dsl.Condition(use_first == False):
        first_dir = skip().output

I do wish there was a more elegant way to do this...

Shota Shimizu
  • 121
  • 2
  • 8