Use dataset registed in on pipelines in AML

Question

I was following the SDK v2 Python tutorial in order to create a pipeline job with my own assets. I notice that in this tutorial they let you use a csv file that can be downloaded but Im trying to use a registered dataset that I already registered by my own. The problem that I facing is that I dont know where I need to specify the dataset.

The funny part is that at the beginning they create this dataset like this:

credit_data = ml_client.data.create_or_update(credit_data)
print(
    f"Dataset with name {credit_data.name} was registered to workspace, the dataset version is {credit_data.version}"
)

But the only part where they refer to this dataset is on the last part where they # the line:

registered_model_name = "credit_defaults_model"

# Let's instantiate the pipeline with the parameters of our choice
pipeline = credit_defaults_pipeline(
    # pipeline_job_data_input=credit_data,
    pipeline_job_data_input=Input(type="uri_file", path=web_path),
    pipeline_job_test_train_ratio=0.2,
    pipeline_job_learning_rate=0.25,
    pipeline_job_registered_model_name=registered_model_name,
)

For me this means that I can use this data like this (a already registered dataset), the problem is that I don't know where I need to do the changes (I know that in the data_prep.py and in the code below but I don´t know where else) and I don't know how to set this:

%%writefile {data_prep_src_dir}/data_prep.py
...

def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data") # <=== Here, but I don´t know how
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--train_data", type=str, help="path to train data")
    parser.add_argument("--test_data", type=str, help="path to test data")
    args = parser.parse_args()

...

Does anyone have experience working as registered datasets?

score 0 · Accepted Answer · answered Jul 06 '22 at 04:42

parser.add_argument("--data", type=str, help="path to input data") # <=== Here, but I don´t know how

To get the path to input data, according to documentation:

You can get --input-data by ID which you can access in your training script.
Use it as argument on mounted_input_path

For example, try the following three code snippets taken from the documentation:

Access dataset in training script:

parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str)
args = parser.parse_args()

run = Run.get_context()
ws = run.experiment.workspace

# get the input dataset by ID
dataset = Dataset.get_by_id(ws, id=args.input_data)

Configure the training run:

src = ScriptRunConfig(source_directory=script_folder,
                      script='train_titanic.py',
                      # pass dataset as an input with friendly name 'titanic'
                      arguments=['--input-data', titanic_ds.as_named_input('titanic')],
                      compute_target=compute_target,
                      environment=myenv)

Pass mounted_input_path as argument:

mounted_input_path = sys.argv[1]
mounted_output_path = sys.argv[2]

print("Argument 1: %s" % mounted_input_path)
print("Argument 2: %s" % mounted_output_path)

References: How to create register dataset and How to use configure a training run with data input and output

Use dataset registed in on pipelines in AML

1 Answers1