I am new to Azure and dealing with all these paths is proving to be extremely challenging. I am trying to create a pipeline that contains a dataprep.py step and an AutoML step. What i want to do is (after passing the input to the dataprep block and performing several operations on it) to save the resulting tabulardataset in the datastore and have it as an output to then be able to reuse in my train block.
My dataprep.py file
-----dataprep stuff and imports
parser = argparse.ArgumentParser()
parser.add_argument("--month_train", required=True)
parser.add_argument("--year_train", required=True)
parser.add_argument('--output_path', dest = 'output_path', required=True)
args = parser.parse_args()
run = Run.get_context()
ws = run.experiment.workspace
datastore = ws.get_default_datastore()
name_dataset_input = 'Customer_data_'+str(args.year_train)
name_dataset_output = 'DATA_PREP_'+str(args.year_train)+'_'+str(args.month_train)
# get the input dataset by name
ds = Dataset.get_by_name(ws, name_dataset_input)
df = ds.to_pandas_dataframe()
# apply is one of my dataprep functions that i defined earlier
df = apply(df, args.mois_train)
# this is where i am having issues, I want to save this in the datastore but also have it as output
ds = Dataset.Tabular.register_pandas_dataframe(df, args.output_path ,name_dataset_output)
The pipeline block instructions.
from azureml.data import OutputFileDatasetConfig
from azureml.pipeline.steps import PythonScriptStep
prepped_data_path = OutputFileDatasetConfig(name="output_path", destination = (datastore, 'managed-dataset/{run-id}/{output-name}'))
dataprep_step = PythonScriptStep(
name="dataprep",
script_name="dataprep.py",
compute_target=compute_target,
runconfig=aml_run_config,
arguments=["--output_path", prepped_data_path, "--month_train", month_train,"--year_train",year_train],
allow_reuse=True