1

I'm new to azure-ml, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'. The following example code is a simplified version of what I want to do:

ws = Workspace.from_config('blabla/config.json')
ds = Datastore.get(ws, datastore_name='test_datastore')

main_ref = DataReference(datastore=ds,
                            data_reference_name='main_ref'
                            )

data_ref = DataReference(datastore=ds,
                            data_reference_name='main_ref',
                            path_on_datastore='/data'
                            )


data_prep_step = PythonScriptStep(
            name='data_prep',
            script_name='pipeline_steps/data_prep.py',
            source_directory='/.',
            arguments=['--main_path', main_ref,
                        '--data_ref_folder', data_ref
                        ],
            inputs=[main_ref, data_ref],
            outputs=[data_ref],
            runconfig=arbitrary_run_config,
            allow_reuse=False
            )

I would like:

  • my data_prep_step to run,
  • have it store some data on the path to my data_ref), and
  • I would then like to access this stored data afterwards outside of the pipeline

But, I can't find a useful function in the documentation. Any guidance would be much appreciated.

Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
Average_guy
  • 509
  • 4
  • 16
  • from where would you like to access this data? from a downstream `PythonScriptStep`? or outside of the ML pipeline entirely? – Anders Swanson Mar 24 '21 at 19:28
  • I want to acces it (main_ref /data) after having test-ran the data_prep_step. Maybe there is more I'm misunderstanding. If anybody knows a good source for "testing pipeline steps individually" please let me know. – Average_guy Mar 24 '21 at 22:07
  • you're singing the song of my people! there are not enough people talking about data pipeline testing, IMHO. do you want: 1) unit testing (the code in the step works? 2) integration testing (the code works when submitted to the Azure ML service) 3) data expectation testing (the data coming out of the meets my expectations) – Anders Swanson Mar 24 '21 at 22:10
  • 1
    I did get a little discouraged trying to find a good source, but perhaps this thread can become a help to future people. I love the three categories you created - I want 3) The data coming out of the step meets my expectations :) – Average_guy Mar 24 '21 at 22:19
  • 1
    I think the core problem for me is, that PipelineData is being used in the main code, and reading from the documentation page of the class, it states: "...produced by one step and consumed in another step". But I only want to run one step, and then check the data (I don't want to pass it to another step, I just want it for comparison reasons!). – Average_guy Mar 24 '21 at 22:26
  • 1
    cool. writing up an answer now. TL;DR. check out `OutputFileDatasetConfig` https://learn.microsoft.com/en-us/azure/machine-learning/how-to-move-data-in-out-of-pipelines#use-outputfiledatasetconfig-for-intermediate-data – Anders Swanson Mar 24 '21 at 22:27
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/230342/discussion-between-anders-swanson-and-average-guy). – Anders Swanson Mar 24 '21 at 22:56

1 Answers1

3

two big ideas here -- let's start with the main one.

main ask

With an Azure ML Pipeline, how can I access the output data of a PythonScriptStep outside of the context of the pipeline?

short answer

Consider using OutputFileDatasetConfig (docs example), instead of DataReference.

To your example above, I would just change your last two definitions.

data_ref = OutputFileDatasetConfig(
    name='data_ref',
    destination=(ds, '/data')
).as_upload()


data_prep_step = PythonScriptStep(
    name='data_prep',
    script_name='pipeline_steps/data_prep.py',
    source_directory='/.',
    arguments=[
        '--main_path', main_ref,
        '--data_ref_folder', data_ref
                ],
    inputs=[main_ref, data_ref],
    outputs=[data_ref],
    runconfig=arbitrary_run_config,
    allow_reuse=False
)

some notes:

  • be sure to check out how DataPaths work. Can be tricky at first glance.
  • set overwrite=False in the `.as_upload() method if you don't want future runs to overwrite the first run's data.

more context

PipelineData used to be the defacto object to pass data ephemerally between pipeline steps. The idea was to make it easy to:

  1. stitch steps together
  2. get the data after the pipeline runs if need be (datastore/azureml/{run_id}/data_ref)

The downside was that you have no control over where the pipeline is saved. If you wanted to data for more than just as a baton that gets passed between steps, you could have a DataTransferStep to land the PipelineData wherever you please after the PythonScriptStep finishes.

This downside is what motivated OutputFileDatasetConfig

auxilary ask

how might I programmatically test the functionality of my Azure ML pipeline?

there are not enough people talking about data pipeline testing, IMHO.

There are three areas of data pipeline testing:

  1. unit testing (the code in the step works?
  2. integration testing (the code works when submitted to the Azure ML service)
  3. data expectation testing (the data coming out of the meets my expectations)

For #1, I think it should be done outside of the pipeline perhaps as part of a package of helper functions For #2, Why not just see if the whole pipeline completes, I think get more information that way. That's how we run our CI.

#3 is the juiciest, and we do this in our pipelines with the Great Expectations (GE) Python library. The GE community calls these "expectation tests". To me you have two options for including expectation tests in your Azure ML pipeline:

  1. within the PythonScriptStep itself, i.e.
    1. run whatever code you have
    2. test the outputs with GE before writing them out; or,
  2. for each functional PythonScriptStep, hang a downstream PythonScriptStep off of it in which you run your expectations against the output data.

Our team does #1, but either strategy should work. What's great about this approach is that you can run your expectation tests by just running your pipeline (which also makes integration testing easy).

Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
  • 1
    I want to apologies for being so late with my response. That being said, thank you so much! Your answer gives a fantastic overview, and I was able to make a working test. After reading more about the documentation of 'OutputFileDatasetConfig', I decided to use 'ScriptRunConfig' instead of 'PythonScriptStep'. – Average_guy Mar 30 '21 at 10:47