2

I am using Azure ML Python SDK for building custom experiment pipeline. I am trying to run the training on my tabular dataset in parallel on a cluster of 4 VMs with GPUs. I am following the documentation available on this link https://learn.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig?view=azure-ml-py

The issue I am facing is that no matter what value I set for mini_batch_size, the individual runs get all rows. I am using EntryScript().logger to check the number of rows passed on to each process. What I see is that my data is being processed 4 times by 4 VMs and not getting split into 4 parts. I have tried setting value of mini_batch_size to 1KB,10KB,1MB, but nothing seems to make a difference.

Here is my code for ParallelRunConfig and ParallelRunStep. Any hints are appreciated. Thanks

#------------------------------------------------#
# Step 2a - Batch config for parallel processing #
#------------------------------------------------#
from azureml.pipeline.steps import ParallelRunConfig

# python script step for batch processing
dataprep_source_dir = "./src"
entry_point = "batch_process.py"
mini_batch_size = "1KB"
time_out = 300

parallel_run_config = ParallelRunConfig(
    environment=custom_env,
    entry_script=entry_point,
    source_directory=dataprep_source_dir,
    output_action="append_row",
    mini_batch_size=mini_batch_size,
    error_threshold=1,
    compute_target=compute_target,
    process_count_per_node=1,
    node_count=vm_max_count,
    run_invocation_timeout=time_out
)


#-------------------------------#
# Step 2b - Run Processing Step #
#-------------------------------#
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.steps import ParallelRunStep
from datetime import datetime

# create upload dataset output for processing
output_datastore_name = processed_set_name
output_datastore = Datastore(workspace, output_datastore_name)

processed_output = PipelineData(name="scores", 
                          datastore=output_datastore, 
                          output_path_on_compute="outputs/")

# pipeline step for parallel processing
parallel_step_name = "batch-process-" + datetime.now().strftime("%Y%m%d%H%M")

process_step = ParallelRunStep(
    name=parallel_step_name,
    inputs=[data_input],
    output=processed_output,
    parallel_run_config=parallel_run_config,
    allow_reuse=False
)
zeeshan
  • 41
  • 2

2 Answers2

2

I have found the cause for this issue. What documentation neglects to mention is that mini_batch_size only works if your tabular dataset comprise of multiple files e.g., multiple parquet files with X number of rows per file. If you have one gigantic file that contains all rows, the mini_batch_size is unable to extract only partial data from it to be processed in parallel. I have solved the problem by configuring Azure Synapse Workspace data pipeline to only store few rows per file.

zeeshan
  • 41
  • 2
0

It works on CSV but not Parquet now. You can batch a CSV file, e.g. https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb

The documentation does not make it clear that certain file types are treated differently