I am using Azure ML Python SDK for building custom experiment pipeline. I am trying to run the training on my tabular dataset in parallel on a cluster of 4 VMs with GPUs. I am following the documentation available on this link https://learn.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig?view=azure-ml-py
The issue I am facing is that no matter what value I set for mini_batch_size
, the individual runs get all rows. I am using EntryScript().logger to check the number of rows passed on to each process. What I see is that my data is being processed 4 times by 4 VMs and not getting split into 4 parts. I have tried setting value of mini_batch_size
to 1KB
,10KB
,1MB
, but nothing seems to make a difference.
Here is my code for ParallelRunConfig and ParallelRunStep. Any hints are appreciated. Thanks
#------------------------------------------------#
# Step 2a - Batch config for parallel processing #
#------------------------------------------------#
from azureml.pipeline.steps import ParallelRunConfig
# python script step for batch processing
dataprep_source_dir = "./src"
entry_point = "batch_process.py"
mini_batch_size = "1KB"
time_out = 300
parallel_run_config = ParallelRunConfig(
environment=custom_env,
entry_script=entry_point,
source_directory=dataprep_source_dir,
output_action="append_row",
mini_batch_size=mini_batch_size,
error_threshold=1,
compute_target=compute_target,
process_count_per_node=1,
node_count=vm_max_count,
run_invocation_timeout=time_out
)
#-------------------------------#
# Step 2b - Run Processing Step #
#-------------------------------#
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.steps import ParallelRunStep
from datetime import datetime
# create upload dataset output for processing
output_datastore_name = processed_set_name
output_datastore = Datastore(workspace, output_datastore_name)
processed_output = PipelineData(name="scores",
datastore=output_datastore,
output_path_on_compute="outputs/")
# pipeline step for parallel processing
parallel_step_name = "batch-process-" + datetime.now().strftime("%Y%m%d%H%M")
process_step = ParallelRunStep(
name=parallel_step_name,
inputs=[data_input],
output=processed_output,
parallel_run_config=parallel_run_config,
allow_reuse=False
)