As mentioned, the input to the step can be a DataReference to the blob folder.
You can use the default store or add your own store to the workspace.
Then add that as an input. Then when you get a handle to that folder in your train code, just iterate over the folder as you normally would. I wouldnt dynamically add steps for each file, I would just read all the files from your storage in a single step.
ds = ws.get_default_datastore()
blob_input_data = DataReference(
datastore=ds,
data_reference_name="data1",
path_on_datastore="folder1/")
step1 = PythonScriptStep(name="1step",
script_name="train.py",
compute_target=compute,
source_directory='./folder1/',
arguments=['--data-folder', blob_input_data],
runconfig=run_config,
inputs=[blob_input_data],
allow_reuse=False)
Then inside your train.py you access the path as
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder')
args = parser.parse_args()
print('Data folder is at:', args.data_folder)
Regarding benefits, it depends on how you are mounting. For example if you are dynamically mounting in code, then the credentials to mount need to be in your code, whereas a DataReference allows you to register credentials once, and we can use KeyVault to fetch them at runtime. Or, if you are statically making the mount on the machine, you are required to run on that machine all the time, whereas a DataReference can dynamically fetch the credentials from any AMLCompute, and will tear that mount down right after the job is over.
Finally, if you want to train on a regular interval, then its pretty easy to schedule it to run regularly. For example
pub_pipeline = pipeline_run1.publish_pipeline(name="Sample 1",description="Some desc", version="1", continue_on_step_failure=True)
recurrence = ScheduleRecurrence(frequency="Hour", interval=1)
schedule = Schedule.create(workspace=ws, name="Schedule for sample",
pipeline_id=pub_pipeline.id,
experiment_name='Schedule_Run_8',
recurrence=recurrence,
wait_for_provisioning=True,
description="Scheduled Run")