0

I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline

Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).

Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.

  1. How can I use data references for this instead?
  2. What does data references do for me that mounting time stamped data does not? a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.
David Crook
  • 2,722
  • 3
  • 23
  • 49

2 Answers2

3

As mentioned, the input to the step can be a DataReference to the blob folder.

You can use the default store or add your own store to the workspace. Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.

ds = ws.get_default_datastore()
blob_input_data = DataReference(
    datastore=ds,
    data_reference_name="data1",
    path_on_datastore="folder1/")

step1 = PythonScriptStep(name="1step",
                         script_name="train.py", 
                         compute_target=compute, 
                         source_directory='./folder1/',
                         arguments=['--data-folder',  blob_input_data],
                         runconfig=run_config,
                         inputs=[blob_input_data],
                         allow_reuse=False)

Then inside your train.py you access the path as

parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str,  dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)

Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.

Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example

pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)

recurrence = ScheduleRecurrence(frequency="Hour", interval=1) 

schedule = Schedule.create(workspace=ws, name="Schedule for sample",
                           pipeline_id=pub_pipeline.id, 
                           experiment_name='Schedule_Run_8',
                           recurrence=recurrence,
                           wait_for_provisioning=True,
                           description="Scheduled Run")
1

You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.

  • you mean in my pipeline creation dynamically iterate over all .json files and add that many steps to the pipeline for each file as a reference? I can see this as possible; does that impact performance? I'm up to 341 files so far and only been collecting for a few weeks. – David Crook Sep 03 '19 at 17:47
  • Yes, that's what I meant. I don't think there will be a perf hit. In case if there is, we can take a look. – Santhosh Pillai Sep 04 '19 at 18:03