4

I'm building an ML pipeline in Kubeflow and I have a question. Is there anything out of the box that allows me to configure my pipeline, such that a step is not rerun if its output exists? I've thought of ways to do this manually (either checking for existing outputs as I'm compiling the pipeline, or having an initial step that returns a list of steps to run, or manually configuring which steps to run as an input parameter) but I cannot find a native way of handling this.

The common use case for me would be to rerun the model step without rerunning any pre-processing of the data; but without having to have a specific "model development" pipeline that would differ from the more general prod one that would include the data pre-processing step. Or perhaps I'm iterating on an evaluation phase and I don't even need retraining but I would still like to use the same pipeline. Right now, colleagues are using several pipelines, that each start at a different step, to work around this.

I'm coming at it from a map-reduce perspective where this is trivial - the framework automatically detects which outputs are present and doesn't rebuild them as default, but easily gives you the option to rebuild some or all of them. Maybe this is biasing my way of working with kubeflow?

Any help appreciated!

1 Answers1

3

Ok, I thought I'd put on here what I've found to solve this.

As of September 2019, this is not a feature of Kubeflow (according to people working on it), but there is a caching feature in the works that should not rerun any steps whose outputs exist.

In the meantime I manually implemented it, via a pipelineParam 'startingStep' from which everything needs to be rerun. Something like this:

with dsl.Condition(first_step_to_run == "prep"):
    create_ops(StartingStep.prep)
with dsl.Condition(first_step_to_run == "train"):
    create_ops(StartingStep.train)
with dsl.Condition(first_step_to_run == "evaluate"):
    create_ops(StartingStep.evaluate)

with a create_ops method that understands what order to create steps in and chains them appropriately (we actually have seven steps so I really wanted to avoid copy/pasting all over).

  • Glad you found a solution. I have a quick question: how are you making sure one step finishes before the next begins? i.e. if you run from `prep`, how do you make sure its output is uploaded, before train begins? – Jonas D Apr 03 '20 at 14:07