AzureML Python SDK OutputFileDatasetConfig and Datastore

Question

I am new to Azure and dealing with all these paths is proving to be extremely challenging. I am trying to create a pipeline that contains a dataprep.py step and an AutoML step. What i want to do is (after passing the input to the dataprep block and performing several operations on it) to save the resulting tabulardataset in the datastore and have it as an output to then be able to reuse in my train block.

My dataprep.py file

   -----dataprep stuff and imports
        
   parser = argparse.ArgumentParser()
   parser.add_argument("--month_train", required=True)
   parser.add_argument("--year_train", required=True)
   parser.add_argument('--output_path', dest = 'output_path', required=True)
        
   args = parser.parse_args()
        
   run = Run.get_context()
   ws = run.experiment.workspace
   datastore = ws.get_default_datastore()
        
   name_dataset_input = 'Customer_data_'+str(args.year_train)
   name_dataset_output = 'DATA_PREP_'+str(args.year_train)+'_'+str(args.month_train)
        
        
   # get the input dataset by name
   ds = Dataset.get_by_name(ws, name_dataset_input)
   df = ds.to_pandas_dataframe()

   # apply is one of my dataprep functions that i defined earlier
   df = apply(df, args.mois_train)

   # this is where i am having issues, I want to save this in the datastore but also have it as output
   ds = Dataset.Tabular.register_pandas_dataframe(df, args.output_path ,name_dataset_output)

The pipeline block instructions.

from azureml.data import OutputFileDatasetConfig
from azureml.pipeline.steps import PythonScriptStep

prepped_data_path = OutputFileDatasetConfig(name="output_path", destination = (datastore, 'managed-dataset/{run-id}/{output-name}'))

    dataprep_step = PythonScriptStep(
        name="dataprep", 
        script_name="dataprep.py", 
        compute_target=compute_target, 
        runconfig=aml_run_config,
        arguments=["--output_path", prepped_data_path, "--month_train", month_train,"--year_train",year_train],
        allow_reuse=True

AzureML Python SDK OutputFileDatasetConfig and Datastore

0 Answers0