I was following the SDK v2 Python tutorial in order to create a pipeline job with my own assets. I notice that in this tutorial they let you use a csv file that can be downloaded but Im trying to use a registered dataset that I already registered by my own. The problem that I facing is that I dont know where I need to specify the dataset.
The funny part is that at the beginning they create this dataset like this:
credit_data = ml_client.data.create_or_update(credit_data)
print(
f"Dataset with name {credit_data.name} was registered to workspace, the dataset version is {credit_data.version}"
)
But the only part where they refer to this dataset is on the last part where they # the line:
registered_model_name = "credit_defaults_model"
# Let's instantiate the pipeline with the parameters of our choice
pipeline = credit_defaults_pipeline(
# pipeline_job_data_input=credit_data,
pipeline_job_data_input=Input(type="uri_file", path=web_path),
pipeline_job_test_train_ratio=0.2,
pipeline_job_learning_rate=0.25,
pipeline_job_registered_model_name=registered_model_name,
)
For me this means that I can use this data like this (a already registered dataset), the problem is that I don't know where I need to do the changes (I know that in the data_prep.py and in the code below but I don´t know where else) and I don't know how to set this:
%%writefile {data_prep_src_dir}/data_prep.py
...
def main():
"""Main function of the script."""
# input and output arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data", type=str, help="path to input data") # <=== Here, but I don´t know how
parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
parser.add_argument("--train_data", type=str, help="path to train data")
parser.add_argument("--test_data", type=str, help="path to test data")
args = parser.parse_args()
...
Does anyone have experience working as registered datasets?